weso caepia-20111108

Post on 05-Dec-2014

857 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Query Expansion Methods and Performance Evaluation

for Reusing Linking Open Data of the

European Public Procurement Notices

Code: TSI-020100-2010-919

José María Álvarez RodríguezWESO-Universidad de Oviedohttp://purl.org/weso/moldeas/

Tecnologías de Linked Data y sus aplicaciones en España (TLDE)CAEPIA 2011-Tenerife (Spain)

8th of November, 2011

OverviewUse case & Context

SPARQL & Performance

Next Steps

Objective

Creation of a pan -european e-procurement platform

Covering almost every publicprocurement notices of the

European regions

E-procurement Long Tail

TEDBOE

(official bulletin of the Spanish Governement) BOPA

(official bulletin of the Asturian Governement)

To Be Able to answer to …

Which public procurement notices are relevant to Dutch companies (only SMEs) that

want to tender for contracts announced by local authorities with a total value lower than 170K € to procure “Road bridge construction work” and a two year duration in the Dutch -

speaking region of Flanders (Belgium)?

XML

TEDTED

RDFizing

CPVCPV

Services

(e.g. Searching,

Matchmaking &

Prediction)BOEBOE

……

NUTSNUTS

Organizations

BOPABOPA

RDFizing

EurovocEurovoc

Linked Data

Api

Pubby+Snorql

1

2

3

4

5

Structuring public procurement notices

Transforming government classifications

LOD enrichment

Providing new semantic-based services

Easing the access to thepublished data using the

LOD approach

Semantic

Methods

1,2,3 Preliminary Results

Information Triples Total

Common ProcurementVocabulary(2003 y 2008)

~300,00 ~11 millions of

RDF triples

Organizations ~5,000,000

NUTS 36,219

Public procurementnotices(2008-2011)

677,058

2,398,601

2,590,880

402,264

4 Semantic -basedServices

Problem of«Query Expansion » depending on the kind of

information variable

4 Methods of«QueryExpansion »

Expansion

Individual

Taxonomy-based

Directly

Syntactic Search

SpreadingActivation

Recommendingengine

Location

Georeasoning

User-based

Numeric

Fuzzy Logic

History-based

Correlation

Group

Recommendingengine

Remembering …

Which public procurement notices are relevant to Dutch companies (only SMEs) that

want to tender for contracts announced by local authorities with a total value lower than 170K € to procure “Road bridge construction work” and a two year duration in the Dutch -

speaking region of Flanders (Belgium)?

Query …

?ppn

NUTS-B3 300 RÉG. WALLONNE

cpv:45221111-3

SME170,000 €

cpv:45221111-3 NL

ppn:nutsCode

cpv:CodeIn2008

org:classification ppn:hasAmount

2 years

ppn:hasDuration

Applying Query Expansion …

cpv:45221111-3 NL

?ppn

SME170,000-200,000 €

ppn:nutsCode

cpv:CodeIn2008

org:classification ppn:hasAmount

2-3 years

ppn:hasDuration cpv:45221111-3cpv:45221110-6cpv:45221113-7cpv:45221114-4

NUTS-B3NUTS-NL326NUTS-1025NUTS-BE2

4 Example of SPARQL query

SELECT DISTINCT * WHERE {?ppn rdf:type <http://purl.org/weso/ppn/def#ppn>.?ppn ppn:nutsCode ?nutsCode.?ppn cpv:codeIn2008 ?cpvCode. ?ppn ppn:hasDuration ?duration?ppn dc:identifier ?id.?ppn dc:date ?date . ? ppn ppn:hasAmount ?amount.FILTER(? cpvCode = cpv:45221111-3 ... ) .FILTER (

(xsd:double(?amount) >= xsd:long(170,000)) && (xsd:double(?amount) <= xsd:long(200,000)) ).

. FILTER(?nutsCode = nuts:B3 ... ) .FILTER (

(xsd:long(?duration) >= xsd:long(2)) && (xsd:long(?duration) <= xsd:long(3)) ).

}

Context

Performance of SPARQL Queries

~30 sec.

Hardware & Software

DELL PC 2GB RAM and 30GB HardDiskVirtual Box (version 4.0.6)

Linux 2.6.35-22-server #33-Ubuntu 2 SMP x86_64 GNU/Linux

Ubuntu 10.10

OpenLink Virtuoso Opensource-6-20110218

Question ?

How to decrease the time of query execution without

modify the hardware and not use any vendor feature?

TripleStore

25 graphs20 M of RDF Triples

But…

8 graphs11 M of RDF Triples

Focus on..

The generation of SPARQL queries

Let’s start …

9 SPARQL Queries

3 executions

Ti Simple Enhanced LIMIT FILTER GRAPHS Split Parallel Total

queries

T1 * 1

T2 * * 1

T3 * 1

T4 * * 1

T5 * * * 1

T6 * * * * * 4

T6-1 * * * * * * 4

T7 * * * * 5

T7-1 * * * * * 5

T8 * * * * * 20

T8-1 * * * * * * 20

T-9 * * * * * 15

T-91 * * * * * 15

T10 * * * * * 60

T10-1 * * * * * * 60

Simple SPARQL query

SELECT DISTINCT * WHERE {?ppn rdf:type <http://purl.org/weso/ppn/def#ppn>.?ppn ppn:nutsCode ?nutsCode.?ppn cpv:codeIn2008 ?cpvCode. ?ppn ppn:hasDuration ?duration?ppn dc:identifier ?id.?ppn dc:date ?date . ? ppn ppn:hasAmount ?amount.FILTER(? cpvCode = cpv:15331137 ) .

. FILTER(?nutsCode = nuts:UK ) .}

Simple Query

1 CPV Code1 NUTS Code

Time: ~3,29 sec.

1

T1

Rewrite SPARQL queries:Match triples from specific to

general

Filter as soon as possible

T2

Use the LIMIT clause

Value set to 10,000

Rewrite SPARQL query

SELECT DISTINCT * WHERE {?ppn rdf:type <http://purl.org/weso/ppn/def#ppn>.?ppn cpv:codeIn2008 ?cpvCode. FILTER(? cpvCode = cpv:15331137 ) .?ppn ppn:nutsCode ?nutsCode.FILTER(?nutsCode = nuts:UK ) .?ppn ppn:hasDuration ?duration?ppn dc:identifier ?id.?ppn dc:date ?date . ? ppn ppn:hasAmount ?amount.

. } LIMIT 10000

Results T22

1 CPV Code1 NUTS Code

Time: ~3,26 sec.

Evaluation

There is no significant changes in execution time

and gain…and

We are interested in “enhanced queries ”

T3

Execution of enhancedqueries

Enhanced SPARQL query

SELECT DISTINCT * WHERE {?ppn rdf:type <http://purl.org/weso/ppn/def#ppn>.?ppn ppn:nutsCode ?nutsCode.?ppn cpv:codeIn2008 ?cpvCode. ?ppn ppn:hasDuration ?duration?ppn dc:identifier ?id.?ppn dc:date ?date . ? ppn ppn:hasAmount ?amount.FILTER(? cpvCode = {cpv:15331137 , cpv:48611000,

cpv: 48611000, cpv:50531510, cpv: 15871210 }) .. FILTER(?nutsCode = {nuts:B3, nuts:PL, nuts:RO ) .}

5 CPV Codes3 NUTS Codes

1 query

3 Results T3

Time: ~20,65 sec.

T4

Rewrite SPARQL queries+

Use the LIMIT clause

5 CPV Codes3 NUTS Codes

1 query

4 Results T4 wrt T3

Time: ~20,55 sec.

Info

8 graphs

11 M of RDF Triples

T5

Rewrite SPARQL queries+

Use the LIMIT clause+

Named Graphs (FROM)

5 CPV Codes3 NUTS Codes

1 query

5 Results T5 wrt T3

Time: ~20,65 sec.

T6

Rewrite SPARQL queries+

Use the LIMIT clause+

Named Graphs (FROM)+

Split into simple queries

5 CPV Codes3 NUTS Codes

4 Graphs4 simple queries

6 Results T6 wrt T3

Time: ~20,60 sec.

T6-1

Rewrite SPARQL queries+

Use the LIMIT clause+

Named Graphs (FROM)+

Split enhance query into simple queries+

Parallelization of query execution (ad-hoc map/reduce)

5 CPV Codes3 NUTS Codes

4 Graphs4 simple queries

6-1 Results T6-1 wrt T3

Time: ~11,93 sec.

T7

Rewrite SPARQL queries+

Use the LIMIT clause+

Split enhance query into simple queries

1 CPV Code (5)3 NUTS Code

5 simple queries

7 Results T7 wrt T3

Time: ~15,81 sec.

T7-1

Rewrite SPARQL queries+

Use the LIMIT clause+

Split enhance query into simple queries+

Parallelization of query execution (ad-hoc map/reduce)

1 CPV Code (5)3 NUTS Codes

5 simple queries

7-1 Results T7-1 wrt T3

Time: ~10,55 sec.

T8

Rewrite SPARQL queries+

Use the LIMIT clause+

Named Graphs (FROM)+

Split into simple queries

1 CPV Code (5)3 NUTS Codes

4 Graphs20 simple queries

8 Results T8 wrt T3

Time: ~32,34 sec.

T8-1

Rewrite SPARQL queries+

Use the LIMIT clause+

Named Graphs (FROM)+

Split enhance query into simple queries+

Parallelization of query execution (ad-hoc map/reduce)

1 CPV Code (5)3 NUTS Codes

4 Graphs20 simple queries

8-1 Results T8-1 wrt T3

Time: ~18,45 sec.

T9

Rewrite SPARQL queries+

Use the LIMIT clause+

Split enhance query into simple queries (1 CPV code+1 NUTS code)

1 CPV Code (5)1 NUTS Code (3)

15 simple queries

9 Results T9 wrt T3

Time: ~22,462 sec.

T9-1

Rewrite SPARQL queries+

Use the LIMIT clause+

Split enhance query into simple queries (1 CPV code+1 NUTS code)

+Parallelization of query execution

(ad-hoc map/reduce)

1 CPV Code (5)1 NUTS Code (3)

15 simple queries

9-1 Results T9-1 wrt T3

Time: ~12,77 sec.

T10

Rewrite SPARQL queries+

Use the LIMIT clause+

Named Graphs (FROM)+

Split into simple queries(1 CPV code+1 NUTS code )

1 CPV Code (5)1 NUTS Code (3)

4 Graphs60 simple queries

10 Results T10 wrt T3

Time: ~71,17 sec.

T10-1Rewrite SPARQL queries

+Use the LIMIT clause

+Named Graphs (FROM)

+Split enhance query into simple queries

(1 CPV code+1 NUTS code )+

Parallelization of query execution (ad-hoc map/reduce)

1 CPV Code (5)1 NUTS Code (3)

4 Graphs60 simple queries

10-1 Results T10-1 wrt T3

Time: ~35,13 sec.

Ti Table of ResultsTime (sec.) Gain (%)

T1 3,29 N/AT2 3,26 0,93T3 20,65 N/AT4 20,55 0,49T5 20,65 0T6 20,6 0,24T6-1 11,93 73,09T7 15,81 30,61T7-1 10,55 95,73

T8 32,34 -36,15T8-1 18,45 11,92T9 22,62 -8,71T9-1 12,77 61,71T10 71,63 -71,17

T10-1 35,13 -41,22

Discussion

• The number of queries is a key-factor• The number of CPV codes implies more

execution time• The parallelization improves execution

time• T7-1 is the best execution in terms of

time• Rewrite SPARQL queries• Use the LIMIT clause• Split enhance query into simple queries • Parallelization of query execution

Further Steps

• Distribute graphs in different nodes (HW improvement)

• Use of other triple stores • (SW comparison)• Add SPARQL 1.1 new features

(Expressiveness improvement)• Cache of queries (SW improvement)

SomeReferences …

• http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/index. html#comparison

• http://www.slideshare.net/olafhartig/an-overview-on -linked-data-management-and-sparql-querying-isslod2011

• http://squin.sourceforge.net/• http://www2.informatik.hu-

berlin.de/~hartig/files/Slides_Hartig_ISSLOD2011.pd f• http://www2008.org/papers/pdf/p595-stocker1.pdf• http://www.informatik.uni-

freiburg.de/~mschmidt/docs/diss_final01122010.pdf• http://mayor2.dia.fi.upm.es/oeg-upm/files/sparql-dq p/eswc11-bac-ext.pdf• http://www.slideshare.net/olafhartig/the-sparql-que ry-graph-model-for-

query-optimization-1259536• http://www.w3.org/TR/sparql-features/

Query Expansion Methods and Performance Evaluation

for Reusing Linking Open Data of the

European Public Procurement Notices

Code: TSI-020100-2010-919

José María Álvarez RodríguezWESO-Universidad de Oviedohttp://purl.org/weso/moldeas/

Tecnologías de Linked Data y sus aplicaciones en España (TLDE)CAEPIA 2011-Tenerife (Spain)

8th of November, 2011

top related