entity matching for semistructured data in the cloud

43
Entity Matching for Semistructured Data in the Cloud Marcus Paradies ACM SAC 2012 - CC Track March 27, 2012 1 / 19 Entity Matching for Semistructured Data in the Cloud Marcus Paradies N

Upload: marcus-paradies

Post on 11-May-2015

576 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Entity Matching for Semistructured Data in the Cloud

Entity Matching for Semistructured Datain the Cloud

Marcus ParadiesACM SAC 2012 - CC Track

March 27, 2012

1 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 2: Entity Matching for Semistructured Data in the Cloud

Outline

1 Motivation

2 ChuQL

3 Entity Matching

4 MAXIM: Entity Matching in the Cloud

5 Summary

2 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 3: Entity Matching for Semistructured Data in the Cloud

Motivation

Enriching/Improving Wikipedia

References from Wikipedia article Hash join

3 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 4: Entity Matching for Semistructured Data in the Cloud

Motivation

Enriching/Improving Wikipedia

Lookup in the CiteSeer database

3 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 5: Entity Matching for Semistructured Data in the Cloud

Motivation

Enriching/Improving Wikipedia

Lookup in Google

3 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 6: Entity Matching for Semistructured Data in the Cloud

Motivation

Wikipedia in a nutshell

Characteristics

3.7 Mio articles (english Wikipedia database)

Dataset size about 30GB of XML (without history)

3.6 Mio references

References are categorized into books, journals, websites, etc.

Challenges

Articles in Wikipedia are incomplete

Articles in Wikipedia are inaccurate

Articles in Wikipedia are subjective

4 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 7: Entity Matching for Semistructured Data in the Cloud

Motivation

Wikipedia in a nutshell

Characteristics

3.7 Mio articles (english Wikipedia database)

Dataset size about 30GB of XML (without history)

3.6 Mio references

References are categorized into books, journals, websites, etc.

Challenges

Articles in Wikipedia are incomplete

Articles in Wikipedia are inaccurate

Articles in Wikipedia are subjective

4 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 8: Entity Matching for Semistructured Data in the Cloud

Motivation

Problem Statement

Definition

Given two datasets of records, R and S , a set of attributesa1, . . . , an, a set of similarity functions sima1 , . . . , siman and asimilarity threshold τ , the task between R and S is defined asfinding and combining all pairs of records from R and S where∑n

i=1 simai (R.ai , S .ai ) ≥ τ

{{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }}

{{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }}

<record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi></record>

<record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi></record>

Wikipedia Data set CiteSeer Data set

5 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 9: Entity Matching for Semistructured Data in the Cloud

Motivation

Problem Statement

Definition

Given two datasets of records, R and S , a set of attributesa1, . . . , an, a set of similarity functions sima1 , . . . , siman and asimilarity threshold τ , the task between R and S is defined asfinding and combining all pairs of records from R and S where∑n

i=1 simai (R.ai , S .ai ) ≥ τ

{{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }}

{{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }}

<record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi></record>

<record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi></record>

Wikipedia Data set CiteSeer Data set

5 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 10: Entity Matching for Semistructured Data in the Cloud

ChuQL

6 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 11: Entity Matching for Semistructured Data in the Cloud

ChuQL

ChuQL by example

Wordcount in ChuQL

1 mapreduce {

2 input { fn:collection ("hdfs :// wiki /") }

3 rr { for $rev in $hxml:in// revision

4 return {"key": fn:data($x// username|$x//ip),

5 "val": $x//title } }

6 map { $hxml:in }

7 reduce { {"key": $hxml:in=>"key", "value": fn:count($hxml:in=>"val ")} }

8 rw { <author name ="{ $hxml:in=>"key"}" count ="{ $hxml:in=>"val"}"/> }

9 output { fn:put("hdfs :// outputdir /") }

10 }

7 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 12: Entity Matching for Semistructured Data in the Cloud

Entity Matching

8 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 13: Entity Matching for Semistructured Data in the Cloud

Entity Matching

What is Entity Matching?

Challenges

Entity Matching has quadratic runtime behavior

Entity Matching has high CPU- and memory demands

The definition of “what is similar” is domain-dependent

9 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 14: Entity Matching for Semistructured Data in the Cloud

Entity Matching

What is Entity Matching?

Challenges

Entity Matching has quadratic runtime behavior

Entity Matching has high CPU- and memory demands

The definition of “what is similar” is domain-dependent

9 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 15: Entity Matching for Semistructured Data in the Cloud

Entity Matching

Entity Matching Architecture

Data Source

S1

Data Source

S1

Data Source

S2

Data Source

S2

BlockingBlocking

b1b1

b2b2

b3b3

bnbn

...

MatchingMatching Match Result

R

Match Result

R

10 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 16: Entity Matching for Semistructured Data in the Cloud

Entity Matching

Entity Matching Architecture

Data Source

S1

Data Source

S1

Data Source

S2

Data Source

S2

BlockingBlocking

b1b1

b2b2

b3b3

bnbn

...

MatchingMatching Match Result

R

Match Result

R

How can we improve the runtime of an EM task?

10 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 17: Entity Matching for Semistructured Data in the Cloud

Entity Matching

Entity Matching Architecture

Data Source

S1

Data Source

S1

Data Source

S2

Data Source

S2

BlockingBlocking

b1b1

b2b2

b3b3

bnbn

...

MatchingMatching Match Result

R

Match Result

R

Distributed Blocking

10 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 18: Entity Matching for Semistructured Data in the Cloud

Entity Matching

Entity Matching Architecture

Data Source

S1

Data Source

S1

Data Source

S2

Data Source

S2

BlockingBlocking

b1b1

b2b2

b3b3

bnbn

...

MatchingMatching Match Result

R

Match Result

R

Distributed Blocking Parallel Matching

10 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 19: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matchingin the Cloud

11 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 20: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Requirements and Approach

Requirements

Efficient processing of semistructured data

Scalability to large datasets

Independency from specific similarity functions

Ability to easily add new similarity functions

Main Idea

Use MapReduce and ChuQL to process semistructured data

Use a search-based blocking to generate candidate pairs

Apply similarity functions to candidate pairs within a block

12 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 21: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Requirements and Approach

Requirements

Efficient processing of semistructured data

Scalability to large datasets

Independency from specific similarity functions

Ability to easily add new similarity functions

Main Idea

Use MapReduce and ChuQL to process semistructured data

Use a search-based blocking to generate candidate pairs

Apply similarity functions to candidate pairs within a block

12 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 22: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Architecture

Node 1

Task Tracker

ChuQL Engine

Data Node

Had

oop

Full-textIndex

Search Engine

Node 2

Task Tracker

ChuQL Engine

Data Node

Had

oop

Full-textIndex

SearchEngine

Node N

Task Tracker

ChuQL Engine

Data Node

Had

oop

Full-textIndex

SearchEngine

...

HDFSHDFS

Architecture

Hadoop cluster with up to 40 nodes

Each node runs a search engine and an attached full-text index

Each node runs an in-memory XQuery processor

Semistructured data is partitioned and placed on HDFS

13 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 23: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Processing Stages

Search EnginesSearch Engines

HDFSHDFS

Three Stages

Preparation Stage

Blocking Stage

Matching Stage

14 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 24: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Processing Stages

Extract Wikipedia references

Extract Wikipedia references

Index CiteSeerX records

Index CiteSeerX records

Search EnginesSearch Engines

HDFSHDFS

Extract references

Store references

Transform into full-text index XML

Build index

Preparation Stage

Stage 1: Preparation Stage

Extracts references from Wikipedia

Reads and transforms records from CiteSeerX

Sends CiteSeerX data to local full-text index

14 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 25: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Processing Stages

Extract Wikipedia references

Extract Wikipedia references

Index CiteSeerX records

Index CiteSeerX records

Generate SemanticBlock

Generate SemanticBlock

Search EnginesSearch Engines

HDFSHDFS

Extract references

Store references

Transform into full-text index XML

Build index

Retrieve references

Generate query

Get query response

Store blocks

Preparation Stage Blocking Stage

Stage 2: Blocking Stage

Reads extracted references from HDFS

Probes full-text index to retrieve candidate publications

Assign candidate publications to block(s)

14 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 26: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Processing Stages

Extract Wikipedia references

Extract Wikipedia references

Index CiteSeerX records

Index CiteSeerX records

Generate SemanticBlock

Generate SemanticBlock Record pair generationRecord pair generation

Search EnginesSearch Engines

HDFSHDFS

Extract references

Store references

Transform into full-text index XML

Build index

Retrieve references

Generate query

Get query response

Store blocks

Verify candidatepairs

Store record pairs

Preparation Stage Blocking Stage Matching Stage

Stage 3: Matching Stage

Read blocks from HDFS

Generate candidate pairs and apply similarity functions

Store matching pairs and their similarity

14 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 27: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Stage 1: Preparation Stage

Extracting References

Indexing Publications

15 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 28: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Stage 1: Preparation Stage

Extracting References

{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197}}

Extraction

Indexing Publications

15 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 29: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Stage 1: Preparation Stage

Extracting References

{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197}}

<reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages></reference>

Extraction

Transformation

Indexing Publications

15 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 30: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Stage 1: Preparation Stage

Extracting References

{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197}}

<reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages></reference>

Extraction

Transformation

Indexing Publications

HDFS

15 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 31: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Stage 1: Preparation Stage

Extracting References

{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197}}

<reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages></reference>

Extraction

Transformation

Indexing Publications

<doc> <field name="id">10.1.1.49.2550</field> <field name="title">Selecting Tense, Aspect, and Connecting Words In Language Generation</field> <field name="author">Bonnie Dorr</field> <field name="description">Generating language ...</field> </doc>

HDFS

Read and Transformation

15 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 32: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Stage 1: Preparation Stage

Extracting References

{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197}}

<reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages></reference>

Extraction

Transformation

Indexing Publications

<doc> <field name="id">10.1.1.49.2550</field> <field name="title">Selecting Tense, Aspect, and Connecting Words In Language Generation</field> <field name="author">Bonnie Dorr</field> <field name="description">Generating language ...</field> </doc>

LuceneIndex

LuceneIndex

HDFS

Read and Transformation

Indexing

15 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 33: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Stage 2: Blocking Stage

Block generation

Each reference generates a set of candidate publications

Each candidate publication is inserted into all blocks, which arelisted in reference

16 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 34: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Stage 2: Blocking Stage

Block generation

Each reference generates a set of candidate publications

Each candidate publication is inserted into all blocks, which arelisted in reference

Example

<citation> <id>26334893</id> <cat>Search engine optimization</cat> <cat>Internet search algorithms</cat> <cat>Link analysis</cat> <ref> <type>journal</type> <author>Taher Haveliwala</author> <year>2003</year> <pages>56-70</pages> <title>The Second Eigenvalue

of the Google Matrix</title> <journal>Stanford University

Technical Report</journal> </ref></citation>

<citation> <id>26334893</id> <cat>Search engine optimization</cat> <cat>Internet search algorithms</cat> <cat>Link analysis</cat> <ref> <type>journal</type> <author>Taher Haveliwala</author> <year>2003</year> <pages>56-70</pages> <title>The Second Eigenvalue

of the Google Matrix</title> <journal>Stanford University

Technical Report</journal> </ref></citation>

<citation> <id>26334893</id> <cat>Hashing</cat> <cat>Join algorithms</cat> <ref> <type>journal</type> <author>Hansjörg Zeller</author> <author>Jim Gray</author> <year>1990</year> <pages>186-197</pages> <title>An Adaptive Hash Join Algorithm for Multiuser Environments</title> <journal>Proceedings of the 16th VLDB

conference</journal> </ref></citation>

Full-Text Index

Search Engine

Hashing

Join algorithms

10.0.1.1.124

10.0.1.1.124

10.0.7.23.14

10.0.1.11.23

send result

send result

send query

16 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 35: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Stage 2: Blocking Stage

Distributed Search in MAXIM

(a)(a

)(a)

(a)

(b) (b) (b) (b)

Node 2

Task Tracker

ChuQL Engine

Data Node

Had

oop

Full-textIndex

SearchEngine

Node 3

Task Tracker

ChuQL Engine

Data Node

Had

oop

Full-textIndex

SearchEngine

Node 4

Task Tracker

ChuQL Engine

Data NodeH

adoo

p

Full-textIndex

SearchEngine

Node 5

Task Tracker

ChuQL Engine

Data Node

Had

oop

Full-textIndex

Search Engine

Node 1

Task Tracker

ChuQL Engine

Data Node

Had

oop

Full-textIndex

SearchEngine

(c) Collect partial results

(b) HTTP response (partial result)

(a) Send HTTP request (query) (c)

16 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 36: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Stage 3: Matching Stage

Applies user-defined similarity functions to candidate pairs

Each attribute can be evaluated by a specific similarity function

Number of candidate pairs

CP =n∑

i=1

Ci ∗ Ri (1)

n - # of blocks in B1, . . . ,Bn

Ri - # of references in block Bi

Ci - # of candidate publications in block Bi

CP - # of candidate pairs to verify

17 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 37: Entity Matching for Semistructured Data in the Cloud

MAXIM: Entity Matching in the Cloud

Stage 3: Matching Stage

Applies user-defined similarity functions to candidate pairs

Each attribute can be evaluated by a specific similarity function

Number of candidate pairs

CP =n∑

i=1

Ci ∗ Ri (1)

n - # of blocks in B1, . . . ,Bn

Ri - # of references in block Bi

Ci - # of candidate publications in block Bi

CP - # of candidate pairs to verify

17 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 38: Entity Matching for Semistructured Data in the Cloud

Summary

Summary

Wikipedia provides many opportunities for research

Need for efficiently processing semistructured data is increasing

Entity Matching is critical for data integration and data cleaning

Entity Matching is difficult to parallelize due to unbalanced datapartitions

MAXIM parallelizes EM by building blocks of similar records in aclassification fashion

MAXIM allows to define own similarity functions and computationfunctions without changing the algorithm

18 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 39: Entity Matching for Semistructured Data in the Cloud

“Everything that can be invented has been invented.”

(Charles H. Duell, Commissioner, U.S. Office of Patents, 1899)

19 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 40: Entity Matching for Semistructured Data in the Cloud

Experiments

Scaleup and Speedup

1

2

3

4

5

6

7

8

9

5 10 20 40

Spe

edup

= B

ase

Tim

e / N

ew T

ime

Number of nodes

IdealINDEXING-2000

EXTRACTING-2000BLOCKINGMATCHING

(a) Speedup for all stages

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

5 10 20 40S

cale

up =

Bas

e T

ime

/ New

Tim

e

Number of nodes

IdealEXTRACTING-2000

INDEXING-2000

(b) Scaleup for preparation stage

20 / 23Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 41: Entity Matching for Semistructured Data in the Cloud

Experiments

Query Performance

0

100

200

300

400

500

600

700

800

900

5 10 20 40

Avg

. Que

ry R

espo

nse

Tim

e (m

s)

Number of Nodes

RESULTCOUNT-50RESULTCOUNT-100RESULTCOUNT-150RESULTCOUNT-200

Figure: Query Performance for different result set sizes and cluster sizes.

21 / 23Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 42: Entity Matching for Semistructured Data in the Cloud

Experiments

Blocking Accuracy

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

0 0.25 0.5 0.75 1.0

Acc

urac

y

Variance

IdealWRONG-ORDER

MISPLACED-ENDMISPLACED-ANY

MISSING

Figure: Blocking accuracy for different typographical error classes.

22 / 23Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Page 43: Entity Matching for Semistructured Data in the Cloud

Experiments

Number of Candidate Pairs

0

500000

1e+006

1.5e+006

2e+006

2.5e+006

3e+006

3.5e+006

4e+006

4.5e+006

5e+006

5.5e+006

0.0 0.1 0.25 0.5 0.75 1.0

Num

ber

of c

andi

date

pai

rs

Variance

RSCOUNT-50RSCOUNT-100RSCOUNT-150RSCOUNT-200

Figure: Number of candidate pair verifications in the matching stage.

23 / 23Entity Matching for Semistructured Data in the CloudMarcus Paradies

N