geographically-typed geospatial data source matching with high- quality clustering and multi-...

Geographically-Typed Geospatial Data Source Matching with High-Quality Clustering and Multi-Attribute Matching

Jeffrey PartykaDr. Latifur KhanDr. Bhavani Thuraisingham

Funded by NGA & US Air Force

Topic Outline

• Problem Statement• Background Information• Matching Procedures

- Generalized Solution - N-grams - Non-Geographic Matching (NGT Matching) - Geographic Matching (GT Matching) - Attribute Weighting - High-Quality Clustering - 1:N Matching

• Experimental Results• Future Work

Motivation•Internet Architecture

▫Highly Distributed▫Federated Architecture

•Web Application Problems ▫ Low Performance for Information

Retrieval▫Accuracy of Retrieved Information

Sample Scenario

Rank Data Source

Query: Publication of Academic Staff

MIT Ontology

Karlsruhe Ontology

UMBC Ontology

{Article, Book, Booklet, InBook, InCollection, InProceedings, Manual, Misc, Proceedings, Report, Technical Report, Project Report, Thesis, Master Thesis, PhD Thesis, Unpublished, Faculty Member, Lecturer}

Different Bibliography Ontologies

MIT Ontology

Karlsruhe Ontology

UMBC Ontology

Problem Statement: Schema MatchingGiven 2 data sources, S1 and S2 , each of which is

composed of a set of tables where {T11, T12, T13…T1k…T1m} є S1 and {T21, T22, T23…T2j…T2n} є S2, with 1<= k <= m and 1 <= j <= n, determine the similarity between T1k and T2j

roadName City

Johnson Rd. Plano

School Dr. Richardson

Zeppelin St. Lakehurst

Alma Dr. Richardson

Road County

Custer Pwy Cooke

15th St. Collin

Parker Rd. Collin

Alma Dr. Collin

S1 S2

COUNTY Destination

SNOHOMISH Mukilteo

PIERCE Point Defiance

KITSAP Southworth

SNOHOMISH Edmonds

City County

Anacortes Skagit

Friday Harbor San Juan

Argyle San Juan

Kirkland King

Road Road

Given 2 ontologies, O1 and O2 , each of which is composed of a set of concepts where {C11, C12, C13…C1k…C1m} є O1 and {C21, C22, C23…C2j…C2n} є O2, with 1<= k <= m and 1 <= j <= n, determine the similarity between C1k and C2j

Problem Statement: Ontology Matching

Motivating Scenarios1 Making Complex Business

Decisions

“Should we invest in a new cholesterol drug for the Asia-Pacific region?“

2

Robust Semantic Web Applications

2

R & D

Corporate

Marketing

Regulatory Affairs

Manufacturing

Yes/No/Maybe?

“Find the group of friends around Jeff. Then find the most important person out of the group. Find out if this person was at an event of type Meeting, and happened between 9AM-11AM within 5 miles of UTD”

Jeff, Jeff’s friends

Within 5 miles of UTD

9:00am-11:00am

Yes/No/Maybe?

Social Network

Geospatial Ontology

Temporal Logic

RDFS Lookup

Event of Type ‘Meeting’

Matching ApproachesMappings may be generated in several ways – some approaches are:

(1: Name Matching

(2: Structure Matching

(3: Instance Matching

Email emailAddress

County DSP

Kitsap Kingston

Wahkiak Puget Island

COUNTYNAME CID

TRAIL RANGE DR 96

KITSAP 97

?

Some Definitions Definition 1 (attribute) An attribute of a table T,

denoted as att(T), is defined as a property of T that further describes it.

Definition 2 (instance) An instance x of an attribute att(T) is defined as a data value associated with att(T).

Definition 3 (keyword) A keyword k of an instance x associated with attribute att(T) is defined as a meaningful word (not a stopword) representing a portion of the instance.

Some Definitions (cont) Definition 4a (geographic type (GT)) A geographic

type GT associated with attribute att(T) is defined as a class of instances of att(T) that represent the same geographic feature. (e.g: “lake”, “road”)

Definition 4b (non-geographic type (NGT)) A non-geographic type (NGT) associated with attribute att(T) is defined as a group of keywords from instances of att(T) that are semantically related to each other.

Collin

Plano

Richardson

New Jersey

Trenton

Monmouth

Topic Outline


- Generalized Solution - N-grams - Non-Geographic Matching (NGT) - GT Matching - Attribute Weighting - High-Quality Clustering - 1:N Matching


Overview of Matching Algorithm

1Select attribute pairs for comparison

2

roadName

roadType city

Match instances between compared attributes

townrType rName county

roadName

rName

3

Determine final attribute similarity

K Ave.Jupiter Rd.Coit Rd.

L Ave.LBJ FreewayUS 75

roadName

rName

EBD = .98

Run Sim algorithms…

Determining Semantic Similarity

•We use Entropy-Based Distribution (EBD)•EBD is a measurement of type similarity

between 2 attributes (or columns):

•EBD takes values in the range of [0,1] . Greater EBD corresponds to more similar type distributions between compared attributes (columns)

EBD = H(C|T)

H(C)

Applying EBD to Semantic Matching

att1

X

X

X

Y

Y

Z

att2

X

X

Y

Y

Y

Z

XX

XY

YZ

YY

Y XX

Z

Y Y

XYYY X

XXX

ZZ

Entropy = H(C) =

Conditional Entropy = H(C|T) =

Topic Outline




Matching Using N-grams

• Use commonly occurring N-grams [2,3] in compared attributes to determine similarity (N = 2)

StrName FENAME Status

LOCUST-GROVE DR

LOCUST GROVE

BUILT

TRAIL RANGE DR TRAIL RANGE

BUILT

Street Laddress Raddress

LOUISE -DOVER DR

1600 1798

CR45/MANET CT

2500 2598

TA

Some N-grams extracted from A.StrName = {LO, OC, CU,ST, OV…..}Some N-grams extracted from B.Street = {LO, OU, UI,

OV,…..}

TB

LOLO

OVOV

ST

UI

Conditional Entropy = H(C|T) =

[2] Jeffrey Partyka, Neda Alipanah, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar: Content-based ontology matching for GIS datasets. ACM SIGSPATIAL GIS 2008 (ACM GIS, Laguna Beach, California, Nov. 2008): 51.

[3] Jeffrey Partyka, Neda Alipanah, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar: Ontology Alignment Using Multiple Contexts. 7th International Semantic Web Conference (ISWC) Karlsruhe, Germany, Oct. 2008.

Faults of this Method• Semantically similar columns are not

guaranteed to have a high similarity score

City Country

Dallas USA

Houston USA

Kingston Jamaica

Halifax Canada

Mexico City

Mexico

ctyName country

Shanghai China

Beijing China

Tokyo Japan

New Delhi India

Kuala Lumpur

Malaysia

2-grams extracted from A: {Da, al, la, as, Ho, ou, us…}

A є T1 B є T2

2-grams extracted from B: {Sh, ha, an, ng, gh, ha, ai, Be, ei, ij…}

Topic Outline




Non-Geographic Matching

Dallas USAHoustonTokyoBeijingHalifax

New Delhi

ChinaJamaicaIndia

Malaysia

● Use clustering methods to group keywords of instances together without relying on shared N-grams between instances[4]

● K-means is not suitable because we cannot compute a centroid among string instances, so we use K-medoid clustering

● Use Normalized Google Distance (NGD) as a distance measure between any two keywords in a cluster

● WordNet would not be a suitable distance measure in the GIS domain

[4] Jeffrey Partyka, Latifur Khan, Bhavani M. Thuraisingham: Semantic Schema Matching without Shared Instances. 3rd IEEE International Conference on Semantic Computing (ICSC) Berkeley, California, September 2009: 297-302.

Definition of Google Distance

NGD(x, y)[7] is a measure for the symmetric conditional probability of co-occurrence of x and y

[7] Cilibrasi,R.,Vitányi, P.: The Google Similarity Distance. IEEE Trans. Knowledge and Data Engineering 19, 370--383 (2007)

: Attribute 1

: Attribute 2

Similarity = H(C|T) / H(C)

T1 є O1 T2 є O2

Step 3 Calculate Similarity

Extract distinct keywords from compared attributes

Group distinct keywords together into semantic clusters

Keywords extracted from attributes = {Johnson, Rd., School, 15th,…}

“Rd.”,”Dr.”,”St.”,”Pwy”,…“Johnson”,”School”,”Dr.”….

T1 T2

Step 1

Step 2

roadName City

Johnson Rd. Plano



Road County

Custer Pwy Collin

15th St. Collin

Parker Rd. Collin

K-medoid + NGD instance similarity

Problems with Non-Geographic Matching via NGD + K-medoidIt is possible that two different geographic entities (ie: Dallas,

TX and Dallas County) in the same location will be mistaken for being similar:

roadName City

Johnson Rd. Plano



Alma Dr. Richardson

Preston Rd. Addison

Dallas Pkwy Dallas

Road County

Custer Pwy Cooke

15th St. Collin

Parker Rd. Collin

Alma Dr. Collin

Campbell Rd. Denton

Harry Hines Blvd.

Dallas

Topic Outline




Geographic Type MatchingWe use a gazetteer to determine the geographic type (GT) of an instance[5,6]:

Instances of S1

GTs Instances of S2

AnacortesEdmonds

Victoria ?Clinton ?

Victoria ?Clinton ? Victoria ?

[5] Jeffrey Partyka, Latifur Khan, Bhavani M. Thuraisingham: Geographically-Typed Semantic Schema Matching. In: Divyakant, A., Aref, W., Lu, C.T. et al. (eds.) ACM SIGSPATIAL GIS 2009, Seattle, Washington, pp. 456--459. ACM (Nov. 2009)

[6] Jeffrey Partyka, Latifur Khan, Bhavani M. Thuraisingham: Geospatial Schema Matching with High-Quality Clustering and Multi-Attribute Matching. Submitted to the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2011, May 2011, Shenzhen, China).

Using Latlong Value to Enhance GT Matching

GSim: Combining NGT and GT MatchingWe apply GT matching for an attribute comparison if >= 50%

of the instances involved in the comparison have GT information. If this is not the case, then NGT matching is applied instead[1]:

featureName City

Collin Creek Plano

White Rock Lake

Dallas

Dallas River Lakehurst

Lake County

Cooke Lake Cooke

Mud Lake Collin

Stone Briar Lake

Collin

>= 50% of instances have a GT?

NGT Matching GT Matching

LakeCreekRiver

RockStoneMud

Cooke LakeMud LakeStone Briar Lake

Collin Creek

[1] Jeffrey Partyka, Pallabi Parveen, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar: Enhanced Geographically-Typed Semantic Schema Matching. To appear in the Journal of Web Semantics, 2011.

Topic Outline




Attribute Weighting

• We can distribute the weight of each attribute match based on their importance:

strAdd city state zipCode

1000 Park Blvd. Plano TX 75075

209 Spring Valley Rd.

Richardson

TX 75080

1703 Danube Ln. Plano TX 75075

18431 Roehampton Dr.

Dallas TX 75252

Street Address

City State Zip

100 Genstar Dr.

Dallas TX 75252

2091 Spring Creek Rd.

Plano TX 75075

1704 Danube Ln.

Plano TX 75075


Dallas TX 75252

27%23%

26% 24%

Measuring Attribute Match Importance• Attribute Match Importance determined by:

name roadType

city townroad_type rName

Attribute Uniqueness 1

2

Attribute Relevance

name city ctyName

lakeType name typelakename

destPort

county

edez_id city

Roads

Ports

Lakes

Roads

Sea Ports

LakeFeatures

Dest

Attribute Uniqueness• Determine uniqueness of attributes att1 and att2 involved

in a match (att1-att2) by clustering all attributes from all tables over S1 and S2 :

cutoff 1

cutoff 2

Attribute Clustering

• Use Intercluster Similarity (ICS) to decide if clusters A and B should merge:

• Calculate cutoff point (CP) to determine when to stop clustering:

Cutoff Point vs. # of Cluster Iterations

Calculating AU, corrected EBD value • Calculate AU for an attribute att in a match:

• Calculate pairwise uniqueness (PU) for a match att1-att2:

PUatt1,att2 = avg (AUatt1(T) , AUatt2(T’))

• Recalculate EBD between att1(T)-att2(T’):

EBDcorr (att1,att2) = EBDorig(att1, att2) x PUatt1,att2

rNamename (Roads)name (Ports)

lakename

name (Lakes)Name (Sea

Ports)

destPort Dest

Att Match PUatt1-att2 EBDorig

EBDcorr

Name(Ports) – Name (Sea Ports)

.688 .90 .619

destPort–Dest .938 .80 .750

AUatt ϵ [0,1]

Attribute Weighting Algorithm

Topic Outline




High-Quality Clustering• Due to the inherent randomness of clustering (e.g: choosing

initial centroid), EBD scores may not be stable [6]

• We need a way to produce consistent EBD values - To eliminate EBD variability - To provide a confidence value for our EBD value - To guarantee that our EBD value was generated from a high- quality clustering

• We proposed the following two cluster-based measures (1: Semantic Purity: the “meaning distance” between any two instances within the same cluster (2: Geographic Purity: the GT purity of a given cluster

Cluster Purity MeasuresDistance-based Measure:

ImpS =

Geographic-Type Measure:

Objective Function to be Minimized:

OSSKM = where Wi =

CollinTarrantPlano

KaufmanCoppellRichardso

n

CollinTarrantKaufma

n

PlanoCoppell

Richardson

Topic Outline




1:N MatchingMany relationships are not 1:1, but involve matching groups of entities

id Mailing Address

1 12 Plano Dr., Plano, TX, 75075

2 18 Coit Rd., Richardson, TX, 75080

3 200 Preston Rd., Dallas, TX

4 2 Hedgecoxe Rd.

Street Address

City State Zip

100 Genstar Dr.

Dallas TX 75252


Plano TX 75075

1704 Danube Ln.

Plano TX 75075


Dallas TX 75252

Cmp N1 N2 N3 N4

Defining 1:N Matching• 1:N matching can be defined in many ways

- Optimize similarity or value of N? - Meronymy or Subsumption?

• We chose to optimize similarity (EBD) - Use EBD scores produced from 1-1 matches between

Cmp and Nk

(1 <= k <= N) - Apply greedy algorithm to add attributes to match with Cmp based on decreasing EBD score (highest to lowest) - Any 1:N match will minimize the set difference between GT(Cmp) and the union of the sets of GTs for the N matching attributes. - We do not include an attribute in a 1:N match if it would make the EBD of the current match decrease

1:N Matching Example

id Mailing Address

1 12 Plano Dr., Plano, TX, 75075

2 18 Coit Rd., Richardson, TX, 75080

4 2 Hedgecoxe Rd.

Street Address

City State Zip County

100 Genstar Dr.

Dallas TX 75252 Dallas


Plano TX 75075 Collin

1704 Danube Ln.

Plano TX 75075 Collin

Cmp N1 N2 N3 N4

1 2 3 4

Attribute

1-1 EBD w/ Cmp

1:N EBD

Street Address

.81 .81

City .79 .88

State .72 .92

Zip .66 .95

Final 1:N EBD .95

1

23

4

1:N Matching From Type PerspectiveMailing Address

W X Y Z

W X Y Z

W X Y

Street Address

City State Zip

W X Y Z

W X Y Z

W X Y Z

XX XY

Y Y

YY

Y XX

XY Y

X

YYY

XXXX

WW

Entropy = H(C) Conditional Entropy = H(C|T)

Z

Z

W WWZ Z

WW W

ZZ

Z

X

Z Z

ZY

WW

W W

Greedy 1:N Matching Algorithmprogram 1:N_Matching (S(T2), Sebd(T2)) {

var E(T2) = Φ; var S(T2) = Φ; Sebd(T2) = 0.0; GTCmp = getGTSet(Cmp); E(T2) = getMatchCandidates(Cmp, T2, GTCmp); E(T2) = orderByEBD(E(T2));

For att A ϵ E(T2) with max value of EBD(Cmp,A){ if (increaseEBD(Cmp, Sebd(T2)) { Emax = A; S(T2) = S(T2) U Emax; Sebd(T2) = addEBD(Sebd(T2), EBD(Cmp, Emax)) end if E(T2) = E(T2) – A; end for

}

Proof of CorrectnessTheorem 1: (Proof of Greedy Choice Property for 1:N matching algorithm) – All choices for Emaxx(T2) will be present in an optimal 1:N match with Cmp ϵ T1.

Suppose that SebdN(T2), for an arbitrary SN(T2), produces an optimal EBD. Let us build a new set called S2ebdN(T2) from S2N(T2) such that every attribute included in S2N(T2) represented a value of Emaxx (T2) for some x. Also, the cardinality of SN(T2) and S2N(T2) are equal, and every attribute between SN(T2) and S2N(T2) is identical, except for an arbitrary attribute indexed by r (r <= N) in S2N(T2). Then by the definition of Emaxx for all x in Ex(T2), the EBD value produced between Cmp and attribute r in S2N(T2) is >= the EBD value produced between Cmp and attribute r given in SN(T2) . Since all other attributes are equal between SN(T2) and S2N(T2), then their associated 1:1 EBD scores with Cmp are also identical. Therefore, EBD(Cmp, S2N(T2)) >= EBD (Cmp, SN(T2)), but since SN(T2) produces an optimal EBD with Cmp through SebdN(T2), then EBD(Cmp, S2N(T2)) = EBD (Cmp, SN(T2)). Thus, S2N(T2) also produces an optimal EBD with Cmp through S2ebdN(T2).

Proof of Correctness (cont)Theorem 2: (Proof of optimal substructure property) –

Let SebdN-1(T2), N > 1, be the EBD score corresponding to the attribute match between Cmp ϵ T1 and SN-1(T2) ϵ T2. If SebdN-1(T2) is an optimal EBD score, and SebdN(T2) is obtained by adding Emaxx to SN-1(T2) , then SebdN(T2) must also be an optimal EBD score.

Assume that SN(T2) was formed by adding Emaxx to SN-1(T2), but does not produce an optimal value of SebdN(T2). Emaxx represents the attribute with the highest EBD score with Cmp to be included in SN-1(T2) with respect to all other attributes in Ex(T2). Then this means that SN-1(T2) contains some attribute indexed by r (r <= N-1) whose EBD value is less than that of Emaxr. Thus, SebdN-1(T2) is not an optimal EBD score. This contradicts the statement above that SebdN-1(T2) is an optimal EBD score. Therefore, if SebdN-

1(T2) is an optimal EBD score, and SebdN(T2) is obtained by adding Emaxx to SN-1(T2), then SebdN(T2) must be an optimal EBD score.

Theorem 3: Greedy 1:N matching produces a safe match with an optimal EBD score. This follows from Theorem 1 and Theorem 2.

Dataset Details GIS Transportation Dataset (GTD)

GIS Location Dataset (GLD)

Dataset Details (cont) GIS Point of Interest Dataset (GPD)

- Through all of our datasets, few shared instances exist

- Data is multijurisdictional in nature

- Number of attributes and instances differ

NGT Matching Over GTD

GT Matching Over GTD

The Effect of Latlong Values on Matching in GPD

The Effect of Attribute Weighting on Matching in GTD and GLD

Observing the Effects of Multiple Matching Methods over GPD

(1) GT matching(2) GT matching + latlong(3) GT matching + latlong + NGT matching(4) GT matching + latlong + NGT matching + attribute weighting

GSim vs. N-grams, SVD, NMF & GSimG

1:N Matching Experiment Results

Experiment 1

T1 = {‘Address’}T2 = {‘Street Address’, ‘City’, ‘State’, ‘Zip’}

1:N Matching Experiment Results (cont)

Experiment 2

T1 = {‘Island_Group’}T2 = {‘Island1’, ‘Island2’, ‘Island3’, ‘Island4’, ‘Island5’, ‘Island6’}

‘Island6’ is not a part of ‘Island_Group’

Summary of Matching Methods

Exact Match

Synonym Match

GT Match

GT + Latlong Match

Hierarchical GT Match

N-grams

NGT Matching

GT Matching

GT + Latlong

GT + Cluster Purity

Final GSim (Ideal)

Hierarchical GT Matching• Use a GT hierarchy to match types with relationships

between them (e.g: superclass/subclass, meronym/holonym, etc.)

Bodies of Water

Lakes

Rivers

Rapids

Streams

att1

Dell Lake

Dallas River

Coppell Stream

att2

HP Lake

Collin River

Plano Rapids

Dell Lake

HP Lake

Dallas RiverCoppell

StreamCollin RiverPlano

Rapids

Get GT Relations From Ontology Calculate

Similarity

THANK YOU!

ANY QUESTIONS?

geographically-typed geospatial data source matching with high- quality clustering and multi-...

Documents

attribute attt

geographic type gt

structure matching

instances of attt

b nongeographic type

geographic feature

definitions definition

instance x