mining topic-specific concepts and definitions on the web bing liu, etc kdd03 cs591cxz cs591cxz web...

16
Mining Topic- Mining Topic- Specific Concepts Specific Concepts and Definitions on and Definitions on the Web the Web Bing Liu, etc Bing Liu, etc KDD03 KDD03 91CXZ 1CXZ Web mining: Lexical relationship mi

Upload: willa-anderson

Post on 04-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

Mining Topic-Specific Mining Topic-Specific Concepts and Concepts and

Definitions on the WebDefinitions on the Web

Bing Liu, etcBing Liu, etc

KDD03KDD03

CS591CXZCS591CXZ Web mining: Lexical relationship mining

Page 2: Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

Lexical relationship miningLexical relationship mining• A lexical relationship is a relationship between

words, such as synonym, antonym, hypernym (“poodle” <-- “dog”), and hyponym (“poodle” --> “dog”)

• A lexical relationship is a connection between the meanings of two words in a text which helps the text to hold together. Relevant connections include (rough) synonymy (e.g. woman - person, win - victory) and connections in a field of meaning (e.g. plane - pilot).

Thus, subtopic mining is in this category, but definition mining is not.

Page 3: Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

Information Extraction Information Extraction • MUC

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/Information Extraction: the extraction or

pulling out of pertinent information from large volumes of texts

Items of Information Percentile Reliability Entities 90 Attributes 80 definition falls hereFacts 70Events 60

Attribute: a property of an entity such as its name, alias, descriptor, or type

Page 4: Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

Mining Topic-Specific Concepts Mining Topic-Specific Concepts and Definitions on the Weband Definitions on the Web

• Goal : Systematically learn an unfamiliar topic from Web

• Definitions • Topic hierarchy

• Input : a term “data mining”, “Web mining”• Tasks

– Identify sub-topics or salient concepts • Like building ontology, but no clear hierarchy

E.g.: Genetic Algorithm• Algorithms

– Find and organize definition pages• Definition question answering

– Concept disambiguation

Page 5: Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

TechniquesTechniques

• A lot of heuristics – Simple linguistic patterns

{concept} {-|:} {definition}{concept} {refer(s) to | satisfy(ies)} ……

– Web page tags<h1>,…,<h4> <b> <em> <li> …

• Frequent pattern mining– A classic data mining technique

Page 6: Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

AlgorithmAlgorithm

WebLearn(T)• Submit T to a search engine, get relevant pages• Mines subtopics or salient concepts of T • Finds definition pages• Output the concepts and definition pages to

users.

If a user wants to know more about subtopics T’

do WebLearn(T’)

Page 7: Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

Mining subtopic/salient Mining subtopic/salient concept(1) concept(1)

Input: a set of top-ranked relevant document

Steps:

1. Filter out “noisy” documents• Publication listing pages

“in proceeding”, “journal”

• Forum discussion pages“previous message”, “reply to”

• Pages that do not contain all query terms

Page 8: Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

Mining subtopic/salient concept(2)Mining subtopic/salient concept(2)

2. Identify important phrases in each page• Extract text segments in HTML emphasizing tags

<h1>,…,<h4> <b> <em> <li> …• Except those containing:

• Salutation title (Mr. Dr. Professor)• URL or email address• “conference”, “journal” …• Digits ( KDD2004)• Images• Too many words (15 words as limit)

Page 9: Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

Mining subtopic/salient concept(3)Mining subtopic/salient concept(3)

3. Mine frequent phrases• Input: emphasized text segments• Mine frequent word sets using associate rule mining

technique

4. Eliminate word sets unlikely to be subtopics• Heuristic: those that do not appear alone in

emphasizing tags in any page“process”

• Remove generic words from result set“abstract”, “introduction”, “conclusion”, “research”,…

5. Rank result setsAccording to number of pages they occur

Page 10: Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

Definition FindingDefinition Finding

• Definition identification patterns suitable for Web pages

{concept} {-|:} {definition}{concept} {refer(s) to | satisfy(ies)} …

• HTML structuring clues and hyperlinks• If only one header <h1>, <h2>,… or one big

emphasized segment at the beginning => definition page

• Look up definition pages up to the second level of the hyperlinks, and only hyperlinks with anchor text matching the concept

Page 11: Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

Subtopic disambiguation Subtopic disambiguation • By adding context terms

– usually parent topic or subtopics• context terms tend to dominate results • cannot work for the first (root) topic

• Heuristics to combat domination of context terms– only consider text segments containing the topic or

subtopic– identify pages with topic hierarchy

HTML list tag <li> The hierarchy should also contain other subtopics of the parent topic

– shallow linguistic phenomenaTopic + “approaches” / ”techniques” + ( + “e.g” / “such as” / “including”

+ subtopics ) Then, how does this help disambiguate?

Page 12: Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

Evaluation Evaluation

• Use Google to get the initial set of relevant pages

• Result 1: subtopics / salient concepts Looks pretty good, terms are closely relevant

More salient concepts than subtopics

• Result 2: definition discovery comparison Precision: WebLearn vs Google vs AskJeeves

• Result 3 : disambiguationSeem to be useful

Page 13: Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

AnalysisAnalysis

• Interesting topicPotentially to be used in practice

• A complete system

• Techniques– Avoid NLP, Machine Learning– Apply heuristics of shallow text structures

Page 14: Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

LimitationsLimitations

• Research topics, not much ambiguity

• Techniques: – Heuristics are empirical, by no means being

flawless or exhaustive, and hard to applied to other domains

Page 15: Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

How to improve? -- discussion How to improve? -- discussion

• Better research: – do you think it is a good research topic?

• Better techniques: – what techniques would you like to try to solve

the problme?

Page 16: Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining

Thank you!Thank you!