challenge problem: link mining

15
Challenge Problem: Challenge Problem: Link Mining Link Mining Lise Getoor University of Maryland, College Park

Upload: vila

Post on 14-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Challenge Problem: Link Mining. Lise Getoor University of Maryland, College Park. Link Mining. Data Structured Input: Mining graphs and networks Structured Output: Extracting entity and relationships from unstructured data Making use of Links For ranking nodes - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Challenge Problem:  Link Mining

Challenge Problem: Challenge Problem: Link MiningLink Mining

Lise GetoorUniversity of Maryland, College

Park

Page 2: Challenge Problem:  Link Mining

Link MiningLink Mining

• Data– Structured Input: Mining graphs and networks – Structured Output: Extracting entity and

relationships from unstructured data

• Making use of Links– For ranking nodes– For collective classification of nodes

• Discovering Links– Predicting missing links– Discovering new kinds of links and

relationships

Page 3: Challenge Problem:  Link Mining

Link Mining TasksLink Mining Tasks• Node Centric

– Labeling/ranking nodes (aka Collective Classification/PageRank)

– Consolidating nodes (aka Entity Resolution)– Discovering hidden nodes (aka Group Discovery)

• Edge Centric– Labeling/ranking edges– Predicting the existence of edges– Predicting the number of edges– Discovering new relations/paths

• Graph/Subgraph Centric– Discovering frequent subpatterns – Generative models– Metadata discovery, extraction, and reformulation

Reference: SigKDD Explorations Special Issue on Link Mining, December 2005.

Page 4: Challenge Problem:  Link Mining

The Link Mining ChallengeThe Link Mining Challenge

• Current research mostly focus on a single task, e.g., node ranking or link prediction

• In real data analysis scenarios, we need a mix of all of these capabilities

• Many potential domains:– Bioinformatics– Social network analysis– Citation Analysis– Fraud detection– ….

Page 5: Challenge Problem:  Link Mining

Challenge Problem RequirementsChallenge Problem Requirements

1. Relevant to data mining and based on analysis of large volumes of data (including web, text, images, links, etc), preferably publicly available data.

2. Important and difficult so that its solution will advance the field and benefit the society

3. Interesting and exciting to attract researchers, public and press attention, and funding. This requires a simple and concise problem statement

4. The required domain knowledge should be relatively accessible.

5. Other groups are not actively working on this problem already

Page 6: Challenge Problem:  Link Mining

DomainDomain

Evangelists: “Goal to distribute free encyclopedia to every single person

on the planet in their own language”

Jimmy Wales Wikipedia founder

Detractors::”Wikipedia has gone from a nearly perfect anarchy to an anarchy

with gang rule.”

Larry Sanger Wikipedia co-founder

Know It All: Can Wikipedia Conquer Expertise? Stacy Schiff, New Yorker, July 31, 2006

Collaboratively edited user contributed encyclopediaLargest example of participatory journalism to date.

Mantra: maintain a neutral point of view (NPOV)

“Disaster is not too strong a word for wikipedia… the site is infested with moonbats”

Eric Raymond, Open-source movement figure

Page 8: Challenge Problem:  Link Mining

Task #2: User ClassificationTask #2: User Classification

• Wiki Gnome: user that keeps a low profile, fixing typos, poor grammar and broken links

• Wiki Troll: disruptive user who persistently violates the site’s guidelines

Gnome Troll

vs.

Page 9: Challenge Problem:  Link Mining

Task #3: Text ClassificationTask #3: Text Classification

Three Wikipedia Content Guidelines:

1. NPOV: represent views fairly and without bias

2. Verifiability3. No original research

Page 10: Challenge Problem:  Link Mining

#4: Link Prediction/Completion #4: Link Prediction/Completion • Identify where links should exist

• As Wikipedia grows, it becomes harder for any given author to know about other relevant stuff they can/should link to from some article.

• Some method that could help with this (link suggestion, auto linking, etc.) would potentially be very useful.

• Evaluation: Generate a dataset by taking a given set of wikipedia pages, removing some of the existing links, and then see if a system could identify those places and suggest appropriate links.

Page 11: Challenge Problem:  Link Mining

Other Link Mining TasksOther Link Mining Tasks• Trust/Reputation analysis

– “Gives no privilege to those who know what they are talking about”, William Connolley, climate modeler and Wikipedia admin

• Social network analysis– Identification of communities

• Accuracy– Nature comparison with Britannica (4-3 error ratio)

• Misuse– Vandalism and self-promotion

• Coverage– Which areas aren’t covered, or are poorly covered/linked?

Page 12: Challenge Problem:  Link Mining

But none of these are grand But none of these are grand challenges…challenges…

• According to wikipedia

Page 13: Challenge Problem:  Link Mining

The Wikipedia Grand ChallengeThe Wikipedia Grand Challenge

The Wikipedia Test:

Given a collection of entries constructed via participatory journalism (PJ) vs. link mining (LM),

Can you distinguish between PJ and LM? Which is better?

Evaluation:Via a panel of human expertsVia page rank Solution will require a variety of

integrated link mining capabilities

Page 14: Challenge Problem:  Link Mining

$$ Already Available…$$ Already Available…

• Hutter prizehttp://prize.hutter1.net/

• 50,000 € ≈ $64,000

http://en.wikipedia.org/wiki/The_64%2C000_Dollar_Question

Page 15: Challenge Problem:  Link Mining