link mining

84
08/22/2004 MRDM 2004 Workshop 1 Link Mining Link Mining Lise Getoor University of Maryland, College Park joint work with Indrajit Bhattacharya, Qing Lu and Prithviraj Sen

Upload: winka

Post on 11-Jan-2016

73 views

Category:

Documents


3 download

DESCRIPTION

Link Mining. Lise Getoor University of Maryland, College Park. joint work with Indrajit Bhattacharya, Qing Lu and Prithviraj Sen. Roadmap. Intro to Link Mining Link Mining Tasks Link Mining Challenges Some Current Projects Link-based Classification - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Link Mining

08/22/2004 MRDM 2004 Workshop 1

Link MiningLink Mining

Lise GetoorUniversity of Maryland, College

Park

joint work with Indrajit Bhattacharya, Qing Lu and Prithviraj Sen

Page 2: Link Mining

08/22/2004 MRDM 2004 Workshop 2

RoadmapRoadmap

• Intro to Link Mining– Link Mining Tasks– Link Mining Challenges

• Some Current Projects– Link-based Classification

• Link-based classification using a variety of link descriptions

• Link-based classification using labeled and unlabeled data

– Link-based Clustering• Entity detection• Group Detection

• Conclusion

Page 3: Link Mining

08/22/2004 MRDM 2004 Workshop 3

Link MiningLink Mining• Traditional machine learning and data mining

approaches assume:– A random sample of homogeneous objects from single

relation

• Real world data sets:– Multi-relational, heterogeneous and semi-structured

• Link Mining– newly emerging research area at the intersection of

research in social network and link analysis, hypertext and web mining, graph mining, relational learning and inductive logic programming

Page 4: Link Mining

08/22/2004 MRDM 2004 Workshop 4

Linked DataLinked Data

• Heterogeneous, multi-relational data represented as a graph or network– Nodes are objects

• May have different kinds of objects• Objects have attributes• Objects may have labels or classes

– Edges are links• May have different kinds of links• Links may have attributes• Links may be directed, are not required to be binary

Page 5: Link Mining

08/22/2004 MRDM 2004 Workshop 5

Sample DomainsSample Domains

• web data (web)• bibliographic data (cite)• epidimiological data (epi)• communication data (comm)• customer networks (cust)• collaborative filtering problems (cf)• trust networks (trust)• biological data (bio)

Page 6: Link Mining

08/22/2004 MRDM 2004 Workshop 6

Link Mining TasksLink Mining Tasks

• Link-based Object Classification• Object Type Prediction• Link Type Prediction• Predicting Link Existence• Link Cardinality Estimation• Object Consolidation• Group Detection • Subgraph Discovery• Metadata Mining

Page 7: Link Mining

08/22/2004 MRDM 2004 Workshop 7

Link-based Object ClassificationLink-based Object Classification

• Predicting the category of an object based on its attributes and its links and attributes of linked objects

• web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc.

• cite: Predict the topic of a paper, based on word occurrence, citations, co-citations

• epi: Predict disease type based on characteristics of the patients infected by the disease

Page 8: Link Mining

08/22/2004 MRDM 2004 Workshop 8

Object Class PredictionObject Class Prediction

• Predicting the type of an object based on its attributes and its links and attributes of linked objects

• comm: Predict whether a communication contact is by email, phone call or mail.

• cite: Predict the venue type of a publication (conference, journal, workshop)

Page 9: Link Mining

08/22/2004 MRDM 2004 Workshop 9

Link Type ClassificationLink Type Classification

• Predicting type or purpose of link based on properties of the participating objects

• web: predict advertising link or navigational link; predict an advisor-advisee relationship

• epi: predicting whether contact is familial, co-worker or acquaintance

Page 10: Link Mining

08/22/2004 MRDM 2004 Workshop 10

Predicting Link ExistencePredicting Link Existence

• Predicting whether a link exists between two objects

• web: predict whether there will be a link between two pages

• cite: predicting whether a paper will cite another paper• epi: predicting who a patient’s contacts are

Page 11: Link Mining

08/22/2004 MRDM 2004 Workshop 11

Link Cardinality Estimation ILink Cardinality Estimation I• Predicting the number of links to an object

• web: predict the authoratativeness of a page based on the number of in-links; identifying hubs based on the number of out-links

• cite: predicting the impact of a paper based on the number of citations

• epi: predicting the number of people that will be infected based on the infectiousness of a disease.

Page 12: Link Mining

08/22/2004 MRDM 2004 Workshop 12

Link Cardinality Estimation IILink Cardinality Estimation II• Predicting the number of objects reached along a

path from an object• Important for estimating the number of objects

that will be returned by a query

• web: predicting number of pages retrieved by crawling a site

• cite: predicting the number of citations of a particular author in a specific journal

Page 13: Link Mining

08/22/2004 MRDM 2004 Workshop 13

Object ConsolidationObject Consolidation

• Predicting when two objects are the same, based on their attributes and their links

• aka: record linkage, duplicate elimination, identity uncertainty

• web: predict when two sites are mirrors of each other.• cite: predicting when two citations are referring to the

same paper. • epi: predicting when two disease strains are the same• bio: learning when two names refer to the same protein

Page 14: Link Mining

08/22/2004 MRDM 2004 Workshop 14

Group DetectionGroup Detection

• Predicting when a set of entities belong to the same group based on clustering both object attribute values and link structure

• web – identifying communities • cite – identifying research communities

Page 15: Link Mining

08/22/2004 MRDM 2004 Workshop 15

Subgraph IdentificationSubgraph Identification

• Find characteristic subgraphs• Focus of graph-based data mining (Cook &

Holder, Inokuchi, Washio & Motoda, Kuramochi & Karypis, Yan & Han)

• bio – protein structure discovery• comm – legitimate vs. illegitimate groups• chem – chemical substructure discovery

Page 16: Link Mining

08/22/2004 MRDM 2004 Workshop 16

Metadata MiningMetadata Mining

• Schema mapping, schema discovery, schema reformulation

• cite – matching between two bibliographic sources

• web - discovering schema from unstructured or semi-structured data

• bio – mapping between two medical ontologies

Page 17: Link Mining

08/22/2004 MRDM 2004 Workshop 17

Link Mining TasksLink Mining Tasks

• Link-based Object Classification• Object Type Prediction• Link Type Prediction• Predicting Link Existence• Link Cardinality Estimation• Object Consolidation• Group Detection • Subgraph Discovery• Metadata Mining

Page 18: Link Mining

08/22/2004 MRDM 2004 Workshop 18

Link Mining ChallengesLink Mining Challenges

• Logical vs. Statistical dependencies• Feature construction• Instances vs. Classes• Collective Classification• Collective Consolidation• Effective Use of Labeled & Unlabeled Data• Link Prediction• Closed vs. Open World

Challenges common to any link-based statistical model (Bayesian Logic Programs, Conditional Random Fields, Probabilistic Relational Models, Relational Markov Networks, Relational Probability Trees, Stochastic Logic Programming to name a few)

Page 19: Link Mining

08/22/2004 MRDM 2004 Workshop 19

Logical vs. Statistical DependenceLogical vs. Statistical Dependence

• Coherently handling two types of dependence structures:– Link structure - the logical relationships

between objects– Probabilistic dependence - statistical

relationships between attributes

• Challenge: statistical models that support rich logical relationships

• Model search complicated by the fact that attributes can depend on arbitrarily linked attributes -- issue: how to search this huge space

Page 20: Link Mining

08/22/2004 MRDM 2004 Workshop 20

Model SearchModel Search

P2

P

A1

P3

P1

?

A1

P2

P3

P1

I1I1

Page 21: Link Mining

08/22/2004 MRDM 2004 Workshop 21

Feature ConstructionFeature Construction

• In many cases, objects are linked to a set of objects. To construct a single feature from this set of objects, we may either use:– Aggregation– Selection

Page 22: Link Mining

08/22/2004 MRDM 2004 Workshop 22

P2

P1

P3

AggregationAggregation

I1

mode

P2

P3

P1

P

A1

?

P2

P1

I2

mode

P9

P4

P5

P

A2

?

P6

P8

P7

P

Page 23: Link Mining

08/22/2004 MRDM 2004 Workshop 23

P2

P1

P3

SelectionSelection

I1

P2

P3

P1

P

A1

?

P2

P3

P

Page 24: Link Mining

08/22/2004 MRDM 2004 Workshop 24

Individuals vs. ClassesIndividuals vs. Classes

• Does model refer – explicitly to individuals– classes or generic categories of individuals

• On one hand, we’d like to be able to model that a connection to a particular individual may be highly predictive

• On the other hand, we’d like our models to generalize to new situations, with different individuals

Page 25: Link Mining

08/22/2004 MRDM 2004 Workshop 25

Instance-based DependenciesInstance-based Dependencies

A1

P3

I1

Papers that cite P3 are likely to be

P3

Page 26: Link Mining

08/22/2004 MRDM 2004 Workshop 26

Class-based DependenciesClass-based Dependencies

A1

?I1

Papers that cite are likely to be

?

Page 27: Link Mining

08/22/2004 MRDM 2004 Workshop 27

Collective classificationCollective classification

• Using a link-based statistical model for classification

• Inference using learned model is complicated by the fact that there is correlation between the object labels

Page 28: Link Mining

08/22/2004 MRDM 2004 Workshop 28

Collective consolidationCollective consolidation

• Using a link-based statistical model for object consolidation

• Consolidation decisions should not be made independently

Page 29: Link Mining

08/22/2004 MRDM 2004 Workshop 29

Labeled & Unlabeled DataLabeled & Unlabeled Data

• In link-based domains, unlabeled data provide three sources of information:– Helps us infer object attribute distribution– Links between unlabeled data allow us to make

use of attributes of linked objects– Links between labeled data and unlabeled data

(training data and test data) help us make more accurate inferences

Page 30: Link Mining

08/22/2004 MRDM 2004 Workshop 30

Link Prior ProbabilityLink Prior Probability

• The prior probability of any particular link is typically extraordinarily low

• For medium-sized data sets, we have had success with building explicit models of link existence

• It may be more effective to model links at higher level--required for large data sets!

Page 31: Link Mining

08/22/2004 MRDM 2004 Workshop 31

Closed World vs. Open World Closed World vs. Open World

• The majority of SRL approaches make a closed world assumption, which assumes that we know all the potential entities in the domain

• In many cases, this is unrealistic • Work by Milch, Marti, Russell on BLOG

Page 32: Link Mining

08/22/2004 MRDM 2004 Workshop 32

Link Mining SummaryLink Mining Summary• Link Mining Tasks

– Link-based Object Classification

– Object Type Prediction– Link Type Prediction– Predicting Link Existence

• Link Mining Challenges– Logical vs. Statistical

dependencies– Feature construction– Instances vs. Classes– Collective Classification

– Link Cardinality Estimation– Object Consolidation– Group Detection – Subgraph Discovery– Metadata Mining

– Collective Consolidation– Effective Use of Labeled &

Unlabeled Data– Link Prediction– Closed vs. Open World

Page 33: Link Mining

08/22/2004 MRDM 2004 Workshop 33

RoadmapRoadmap

• Intro to Link Mining– Link Mining Tasks– Link Mining Challenges

• Some Current ProjectsLink-based Classification

• work with Qing Lu and Prithviraj Sen• Link-based classification using a variety of link

descriptions • Link-based classification using labeled and unlabeled

data

– Link-based Clustering• Entity detection• Group Detection

• Conclusion

Page 34: Link Mining

08/22/2004 MRDM 2004 Workshop 34

Object ClassificationObject Classification

• Traditional Object Classification– Assume objects sampled

from a single relation– Object Attributes (OA)

X4

X3

X5

X1

X2

Page 35: Link Mining

08/22/2004 MRDM 2004 Workshop 35

Object Classification with Linked DataObject Classification with Linked Data

• Traditional Object Classification– Assume objects sampled

from a single relation– Object Attributes (OA)

X4

X3

X5

X1

X2

• Linked Data Links among objects Represented as a graph

Page 36: Link Mining

08/22/2004 MRDM 2004 Workshop 36

Link-based Object ClassificationLink-based Object Classification

• Predicting the category of an object based on its attributes and its links and attributes of linked objects

• Citation domain: Predict the topic of a paper, based on word occurrence, citations and co-citations

Page 37: Link Mining

08/22/2004 MRDM 2004 Workshop 37

Related Work: Link-based ClassificationRelated Work: Link-based Classification

• Hypertext Classification using Links– Class labels of linked objects

• Soumen Chakrabarti (1998)• Oh, et al. (1999)

– Unique document ID, Popescul et al. (2002)– Regularities, Yang et al. (2002)

• Use of Unlabeled Data Co-training, Blum and Mitchel (1998) EM-algorithm, Nigam et al. (2000) Systematical investigation of EM and Co-training,

Ghani (2001) TSVM, Joachims (1999)

Page 38: Link Mining

08/22/2004 MRDM 2004 Workshop 38

Our ApproachOur Approach

• Link-based models– Integrate link features with object attributes

using logistic regression– Investigate use of labeled and unlabeled data

for link-based classification

Page 39: Link Mining

08/22/2004 MRDM 2004 Workshop 39

FeaturesFeatures

• Object Attributes– Notation: OA(X)

• Link Descriptions– Notation: LD(X)– Statistics computed from linked objects– Computed separately for each of:

• In-Links(X)• Out-Links(X)• Co-In(X)• Co-Out(X)

– Three types of Link Descriptions:• Mode, Binary, Count

Page 40: Link Mining

08/22/2004 MRDM 2004 Workshop 40

X

Link DescriptionsLink DescriptionsCategories

Page 41: Link Mining

08/22/2004 MRDM 2004 Workshop 41

X

Link DescriptionsLink DescriptionsCategories In-Links(X)

mode:

Page 42: Link Mining

08/22/2004 MRDM 2004 Workshop 42

X

Link DescriptionsLink DescriptionsCategories In-Links(X)

Out-Links(X) mode:

mode:

Page 43: Link Mining

08/22/2004 MRDM 2004 Workshop 43

X

Link DescriptionsLink DescriptionsCategories In-Links(X)

Out-Links(X) mode:

mode:

CO(X)

mode:

Page 44: Link Mining

08/22/2004 MRDM 2004 Workshop 44

X

Link DescriptionsLink DescriptionsCategories In-Links(X)

Out-Links(X) mode:

mode:

CO(X)

mode:

CI(X)

mode:

Page 45: Link Mining

08/22/2004 MRDM 2004 Workshop 45

X

Link DescriptionsLink DescriptionsCategories In-Links(X)

Out-Links(X) mode:

mode:

CO(X)

binary: (1,1,1)

binary: (1,1,0)

binary: (1,1,0)

mode:

CI(X)

binary: (1,0,0)

mode:

Page 46: Link Mining

08/22/2004 MRDM 2004 Workshop 46

X

Link DescriptionsLink DescriptionsCategories In-Links(X)

Out-Links(X) mode:

mode:

CO(X)

binary: (1,1,1)

binary: (1,1,0)

binary: (1,1,0)

count: (1,2,0)

count: (2,1,0)

mode:

CI(X)

binary: (1,0,0)

count: (2,0,0)

mode:

count: (3,1,1)

Page 47: Link Mining

08/22/2004 MRDM 2004 Workshop 47

Predictive Model for ClassificationPredictive Model for Classification

• A structured logistic regression– Compute P(c | OA(X)) and P(c | LD(X)) separately using

separate logistic regression models

– where OA(X) are the object attributes and LDf(X) are the link features

}CO,CI,Out,In{f

f

c cPxLD|cP

xOA|cPmaxarg)x(c

Page 48: Link Mining

08/22/2004 MRDM 2004 Workshop 48

PredictionPrediction

• category set { }

P5

P4

P3

P2

P1

P5

P4

P3

P2

P1

Step 1: Bootstrap using object attributes only

Page 49: Link Mining

08/22/2004 MRDM 2004 Workshop 49

PredictionPrediction

P5

P3

P2

P1

P5

P4

P3

P2

P1

Step 2: Iteratively update the category of each object, based on linked object’s categories

P4P4

Page 50: Link Mining

08/22/2004 MRDM 2004 Workshop 50

Data SetsData Sets

Data Set

paperscitations

categories

vocabulary

CoraI 3181 6185 7 1400

CoraII 3300 11794 10 3174

CiteSeer 3600 7522 6 3000

Page 51: Link Mining

08/22/2004 MRDM 2004 Workshop 51

Experiment IExperiment I

Content and Link Effectiveness

0

10

20

30

40

50

60

70

80

90

CoraI CoraII CiteSeer

Data Sets

Avg

F1

Mea

sure

Content Only

Links Only

Content + Links

Page 52: Link Mining

08/22/2004 MRDM 2004 Workshop 52

Experiment IIExperiment II

Effectiveness of Different Link Types and Models

0

10

20

30

40

50

60

70

80

90

Mode Binary Count Mode Binary Count Mode Binary Count

Data Sets and M odels

Avg

F1

Mea

sure IN-links

OUT-links

CI-links

CO-links

ALL

CoraI CoraII CiteSeer

Page 53: Link Mining

08/22/2004 MRDM 2004 Workshop 53

Experiment IIIExperiment III

• Setup– 20% data as test data– remaining data: 20%, 40%, 60%, 80% labeled

data

• Link-based classification using labeled and unlabeled data– Labeled-only: learn model using only labeled

data– Labeled and Unlabeled: learn model using

both labeled and unlabeled data

Page 54: Link Mining

08/22/2004 MRDM 2004 Workshop 54

Learning with Labeled andLearning with Labeled and Unlabeled Data Unlabeled Data

0.620.640.660.68

0.70.720.740.760.78

0.80.82

20 40 60 80

Percentage Labeled Data

Acc

ura

cy labeled & unlabeled

labeled only

content only

Page 55: Link Mining

08/22/2004 MRDM 2004 Workshop 55

Ordering StrategiesOrdering Strategies

0.65

0.67

0.69

0.71

0.73

0.75

0.77

0.79

0.81

0.83

0.85

0 100 200 300 400 500 600 700 800 900 1000 1100

Iteration

Av

era

ge

Ac

cu

rac

y

RAND 1

DEC PP 1

DEC OUTLINKS 1

INC OUTLINKS 1

DEC PP 2

DEC OUTLINKS 2

INC OUTLINKS 2

RAND (HARD CLASSIF.)

DEC PP (HARD CLASSIF.)

Page 56: Link Mining

08/22/2004 MRDM 2004 Workshop 56

LBC: SummaryLBC: Summary

• Variety of ways of describing link neighborhoods– Mode, Binary, Count– In-links, Out-links, CI-links and CO-links

• In link-based classification, unlabeled data provide useful information:– Helps us infer object attribute distribution– Links between unlabeled data allow us to make use of

attributes of linked objects– Links between labeled data and unlabeled data (training

data and test data) help us make more accurate inferences

• Link-based Challenges addressed:– Feature construction– Collective classification– Use of labeled and unlabeled data

Page 57: Link Mining

08/22/2004 MRDM 2004 Workshop 57

RoadmapRoadmap

• Intro to Link Mining– Link Mining Tasks– Link Mining Challenges

• Some Current Projects– Link-based Classification

• Link-based classification using a variety of link descriptions

• Link-based classification using labeled and unlabeled data

Link-based Clustering• work with Indrajit Bhattacharya• Entity detection• Group Detection

• Conclusion

Page 58: Link Mining

08/22/2004 MRDM 2004 Workshop 58

Deduplication and Group Deduplication and Group DetectionDetection

• Object Consolidation– Observations come with noise or multiple

representations• Multiple entries for the same person in a customer

database

• Group Detection– Identify groups of similar entities

• Group authors by research interest

Page 59: Link Mining

08/22/2004 MRDM 2004 Workshop 59

TerminologyTerminology

Alfred V AhoEntities

Alfred Aho AV AhoAho, A. V.References

LinksAlfred Aho, John Hopcroft, Jeffrey Ullman

AV Aho, BW Kernighan, PJ Weinberger

Entity Groups G1(Programming Languages)

G2(Databases)

G3(Algorithms)

Page 60: Link Mining

08/22/2004 MRDM 2004 Workshop 60

• The two problems need to be addressed together – Goldberg and Senator, KDD 95

Deduplication and Group Deduplication and Group DetectionDetection

DB DB´

Consolidation&

Link Formation

KDD Tools

Knowledge

Page 61: Link Mining

08/22/2004 MRDM 2004 Workshop 61

Related Work: DeduplicationRelated Work: Deduplication

• Statistics– Blocking, Newcombe– “Match/non-match”, Fellegi & Sunter– EM with match variable, Winkler

• AI, Machine Learning– String similarity measures, Monge & Elkan; Cohen – Object Consolidation, ejada et al.– Learning string distances, Bilenko & Mooney – Active learning, Sarawagi et al– Coreference resolution, Mccallum & Wellner – Identity uncertainty, Pasula et al

• Databases– Efficient record linkage, Hernandez &

Stolfo, Monge & Elkan– Use of co-occurrence, Chaudhuri et al,

Ananthakrishna et al

Page 62: Link Mining

08/22/2004 MRDM 2004 Workshop 62

Related Work: Group DetectionRelated Work: Group Detection

• Hypertext Mining– Eigen decomposition for ranking, Brin & Page; Kleinberg– Finding web communities, Gibson et al– “Missing link”, Cohn & Hofmann

• Probabilistic Link Modeling– Generative model for links, Getoor et al.; Kubica et

al

• Text Retrieval– Spectral techniques, Ng, Jordan & Weiss; Dhillon et

al; Ding et al– Probabilistic modeling with latent variables,

Hofmann; Blei, Ng & Jordan; Rosen-Zvi et al

Page 63: Link Mining

08/22/2004 MRDM 2004 Workshop 63

Paper Resolution ProblemPaper Resolution Problem

• Example

– R. Agrawal, R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB-94, 1994.

– Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994.

• Traditionally, string similarity

Page 64: Link Mining

08/22/2004 MRDM 2004 Workshop 64

Author Resolution ProblemAuthor Resolution Problem

• Given a set of papers, determine the set of authors

• First and middle names vary– Check common name transforms

• Problems remain– How about ‘J. Smith’ and ‘John Smith’?– Do two instances of ‘J. Smith’ refer to the same

author?

• Use co-author relationships

Page 65: Link Mining

08/22/2004 MRDM 2004 Workshop 65

Author Deduplication: ExampleAuthor Deduplication: Example

Alfred V Aho

Jeffrey D Ullman

S C Johnson

A V Aho

J D Ullman

Alfred V Aho

Jeffrey D Ullman

S C Johnson

A V Aho

J D Ullman

P1: Code generation for machines with multiregister operations

P2: The universality of database languages

P3: Optimal partial-match retrieval when fields are independently specified

P4: Code generation for expressions with common subexpressions

Aho P1 Aho P3Aho P2 Aho P4

Ullman P1 Ullman P2 Ullman P3 Ullman P4

Johnson P1 Johnson P4

Aho P1,P2,P3,P4

Ullman P1,P2,P3,P4

Johnson P1,P4

Page 66: Link Mining

08/22/2004 MRDM 2004 Workshop 66

Deduplication Using LinksDeduplication Using Links

• Cluster similar author references into duplicates – Problem: define appropriate distance

measure

• Weighted combination of attribute and link distances

Page 67: Link Mining

08/22/2004 MRDM 2004 Workshop 67

Link Distances for DeduplicationLink Distances for Deduplication• To compare two author references, compare all

their links/relations • Distance between two links

– How many duplicates do they share?– d(l1,l2) = 1 – |duplicates(l1,l2)| / max(|l1|,|l2|)

• Distance between Link Sets: Link detail distance – d(l,L) = minl’ in L d(l,l’)– ddetail(L1,L2) = avg[avgl in L1 d(l,L1), avgl in L2 d(l,L2)]

• Distance between Link Sets: Link summary distance– Group detail distance is costly because of pair-wise

comparison between group sets– Maintain group summary: all unique references in the

group set– dsumm(L1,L2)=d(sum1,sum2)

Page 68: Link Mining

08/22/2004 MRDM 2004 Workshop 68

Group Detection: ExampleGroup Detection: Example

A. AhoEntities

LinksAlfred Aho, John Hopcroft, Jeffrey Ullman, Data Structures and Algorithms

AV Aho, R Sethi, J D Ullman, Compilers: Principles, Techniques and Tools

Groups PL Databases Algorithms

J. HopcroftJ. UllmanR. Sethi

• Problem: Discover the hidden set of groups and mapping from entities to groups

Page 69: Link Mining

08/22/2004 MRDM 2004 Workshop 69

Group Detection Using LinksGroup Detection Using Links

• Cluster similar links into groups

Alfred Aho, John Hopcroft, Jeffrey Ullman, Design and Analysis of Computer Algorithms

Alfred Aho, John Hopcroft, Jeffrey Ullman, Data Structures and Algorithms

AV Aho, R Sethi, J D Ullman, Compilers: Principles, Techniques and Tools

Alfred V. Aho, Brian W. Kernighan, Peter J. Weinberger, The AWK Programming Language

Alfred V. Aho, Jeffrey D. Ullman,

Principles of Compiler Design

Algorithms

PL & Compilers

Page 70: Link Mining

08/22/2004 MRDM 2004 Workshop 70

• Define cluster distance considering generation probability of observed links given model M

• Probabilistic Distance dLP of two clusters– What is the change in probability of observed data if the

two clusters are merged?

• Group summary distance dsumm

– dLP is lower for higher overlap in entity sets of two clusters

– But entity sets are the link summaries

– Use summary distance as approximation of dLP

Link Distances for Group DetectionLink Distances for Group Detection

Page 71: Link Mining

08/22/2004 MRDM 2004 Workshop 71

• Initialize clusters – Deduplication: using attribute distance only

• Greedily pick best candidate using d(ci,cj)– Compare different distance measures

• Update link sets / summaries and cluster distances and continue

• Select candidates for merge using threshold

Deduplication / Group Detection Deduplication / Group Detection Using ClusteringUsing Clustering

• Each cluster contains currently known duplicates / links from the same group

Page 72: Link Mining

08/22/2004 MRDM 2004 Workshop 72

Evaluation: Data Generator ParametersEvaluation: Data Generator Parameters

• Structural Parameters– Number of author-entities and groups– Degree of overlap among the groups

• Noise parameters– For deduplication data, generate noisy

attributes for each entity in the links

• Generative Parameters– Number of links– Mean size of links

Page 73: Link Mining

08/22/2004 MRDM 2004 Workshop 73

Evaluation: MetricEvaluation: Metric

• Diversity of clusters– How many different entities does a cluster

contain?– Links from how many different groups does a

cluster contain?

• Dispersion of entities/groups– How many different clusters is an entity

spread over?– How many different clusters are links from

the same group spread over?

• Measure average dispersion and diversity

Page 74: Link Mining

08/22/2004 MRDM 2004 Workshop 74

Results: DeduplicationResults: Deduplication• Algorithm Parameters: mixing weight and threshold

– Superior results with link distances

Dispersion vs diversity for varying alpha

0

100

200

300

400

500

600

700

800

900

1000

1 1.2 1.4 1.6 1.8 2 2.2 2.4entity dispersion

clu

ster

div

ersi

ty

attribute

summary

Dispersion vs diversity for varying threshold

0

100

200

300

400

500

600

700

800

900

1000

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8entity dispersion

clu

ster

div

ersi

ty

attribute

summary

• Detailed results in DMKD ’04 paper, Iterative Record Linkage for Cleaning and Integration, Indrajit Bhattacharya and Lise Getoor.

Page 75: Link Mining

08/22/2004 MRDM 2004 Workshop 75

Results: Distance MeasuresResults: Distance Measures

• Comparison of attribute distance, group summary distance and group detail distance

Cluster Diversity

1

1.5

2

2.5

3

3.5

8001000120014001600180020002200

# clusters

clus

ter

dive

rsity attribute

group_summary

group detail

Entity Dispersion

1.3

1.5

1.7

1.9

2.1

2.3

8001000120014001600180020002200

# clusters

entit

y di

sper

sion attribute

group_summary

group_detail

Page 76: Link Mining

08/22/2004 MRDM 2004 Workshop 76

Deduplicating Real DataDeduplicating Real Data

• Machine learning papers from Citeseer– Citations hand-matched by Steve Lawrence et

al– Author references hand-matched by Culotta &

McCallum– 1504 paper citations– 2892 author references– 1167 author entities identified

• Initial results– Link summary improves entity dispersion over

attribute clustering– Discovered labeling errors that are hard

to identify considering attributes only

Page 77: Link Mining

08/22/2004 MRDM 2004 Workshop 77

Deduplicating Real dataDeduplicating Real data

• 174 | 610 | barron_a_r | A.R. Barron • 175 | 610 | barron_r_l | R.L. Barron

• Not the same entity– Barron, A.R., Barron, R.L., 1988. Statistical

learning networks: a unifying view. In: 1988 Symposium on the Interface: Statistics and Computer Science, pp. 192-203.

Page 78: Link Mining

08/22/2004 MRDM 2004 Workshop 78

Deduplicating Real dataDeduplicating Real data

• 2097 | 8460 | ramakrishnan_c_r | C. R. Ramakrishnan

• 2098 | 8460 | ramakrishnan_i_v | I. V. Ramakrishnan

• Not the same entity– A Symbolic Constraint Solving Framework for Analysis of

Logic Programs, C.R. Ramakrishnan, I.V. Ramakrishnan and R. Sekar, ACM Conference on Partial Evaluation and Semantics based Program Manipulation (PEPM), June 1995

Page 79: Link Mining

08/22/2004 MRDM 2004 Workshop 79

Deduplicating Real dataDeduplicating Real data

• Parse Error– 1734 | 7010 | minton_andrew_b_philips_steven

| Andrew B. Philips Steven Minton

• Same entity as– 1735 | 7020 | minton_s | Minton , S.

Page 80: Link Mining

08/22/2004 MRDM 2004 Workshop 80

Deduplicating Real dataDeduplicating Real data

• Parse Error– 2083 | 8370 | raedt_2_l_de | 2. L. De Raedt

• Same entity as– 2085 | 8380 | raedt_l_de | L. De Raedt

Page 81: Link Mining

08/22/2004 MRDM 2004 Workshop 81

Deduplication and Group Detection Deduplication and Group Detection SummarySummary

• Study of novel distance measures for clustering similar entities in linked environments

• Unified generative model for evaluating the related problems

• Link-based clustering shows superior performance over attribute clustering for both tasks on synthetic data

• Link-based Challenges addressed:– Collective consolidation

Page 82: Link Mining

08/22/2004 MRDM 2004 Workshop 82

RoadmapRoadmap

• Intro to Link Mining– Link Mining Tasks– Link Mining Challenges

• Some Current Projects– Link-based Classification

• work with Qing Lu and Prithviraj Sen• Link-based classification using a variety of link

descriptions • Link-based classification using labeled and unlabeled

data

– Link-based Clustering• Entity detection• Group Detection

• Conclusion

Page 83: Link Mining

08/22/2004 MRDM 2004 Workshop 83

Link Mining SummaryLink Mining Summary• Link Mining Tasks

– Link-based Object Classification

– Object Type Prediction– Link Type Prediction– Predicting Link Existence

• Link Mining Challenges– Logical vs. Statistical

dependencies– Feature construction– Instances vs. Classes– Collective Classification

– Link Cardinality Estimation– Object Consolidation– Group Detection – Subgraph Discovery– Metadata Mining

– Collective Consolidation– Effective Use of Labeled &

Unlabeled Data– Link Prediction– Closed vs. Open World

Page 84: Link Mining

08/22/2004 MRDM 2004 Workshop 84

ReferencesReferences• Deduplication and Group Detection Using Links Indrajit Bhattacharya and Lise Getoor. 10th ACM SIGKDD Workshop on

Link Analysis and Group Detection, Seattle, WA, August 2004.

• Word Sense Disambiguation using Probabilistic Models, Indrajit Bhattacharya, Lise Getoor and Yoshua Bengio. 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, SP, July 2004.

• Iterative Record Linkage for Cleaning and Integration Indrajit Bhattacharya and Lise Getoor. 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Paris, FR, June 2004.

• Using the Structure of Web Sites for Automatic Segmentation of Tables, Kristina Lerman, Lise Getoor, Steve Minton and Craig Knoblock. Proceedings of ACM-SIGMOD 2004 International Conference on Management of Data, Paris, FR, June 2004.

• Structure Discovery using Statistical Relational Learning, Lise Getoor. Data Engineering Bulletin, vol. 26, No. 3, 2003.

• Link Mining: A New Data Mining Challenge, Lise Getoor. SIGKDD Explorations, volume 5, issue 1, 2003. Iterative Deduplication, I. Bhattacharya, L. Getoor.

• Link Mining: A New Data Mining Challenge, L. Getoor. SIGKDD Explorations, volume 4, issue 2, 2003.

• Link-based Classification, Q. Lu and L. Getoor, International Conference on Machine Learning, August, 2003

• Labeled and Unlabeled Data for Link-based Classification, Q. Lu and L. Getoor. ICML workshop on The Continuum from Labeled to Unlabeled Data, August, 2003.

• Link-based Classification for Text Classification and Mining, Q. Lu and L. Getoor. IJCAI workshop on Text Mining and Link Analysis

• IJCAI 03 Workshop: Learning Statistical Models from Relational Data SRL 2003, http://kdl.cs.umass.edu/srl2003

• ICML 04 Workshop: Statistical Relational Learning and Connections to Other Fields, SRL 2004, http://www.cs.umd.edu/srl2004

Supported by