wordnet-enhanced topic .wordnet concept construction • filter out wordnet synsets...

Download Wordnet-Enhanced Topic .Wordnet Concept Construction • Filter out Wordnet synsets that are most

Post on 15-Feb-2019

216 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

Wordnet-Enhanced

Topic Models

Hsin-Min Lu

Department of Information Management

National Taiwan University

1

Outline

Introduction

Literature Review

Wordnet-Enhanced Topic Model

Experiments

2

Introduction

Leveraging unstructured data is a

challenging yet rewarding task

Topic modeling, a family of unsupervised

learning models, is useful in discovering

latent topic structures in free text data

Topic models assume that a document is

the mixture of topic distributions

Each topic is a distribution of the

vocabulary

3

4

Statistical Topic Models for Text Mining

Text

Collections

Probabilistic

Topic Modeling

web 0.21

search 0.10

link 0.08

graph 0.05

Subtopic discovery

Opinion comparison

Summarization

Topical pattern

analysis

term 0.16

relevance 0.08

weight 0.07

feedback 0.04

independ. 0.03

model 0.03

Topic models

(Multinomial distributions)

PLSA [Hofmann 99]

LDA [Blei et al. 03]

Author-Topic

[Steyvers et al. 04]

Pachinko allocation [Li & McCallum 06]

Topic over time [Wang et al. 06]

Introduction (Contd.)

Introduction (Contd.)

An on-going research stream is to

incorporate meta-data variables into topic

modeling

Richer models

Useful estimation results

This study aims at incorporating Wordnet

synset information into topic models

A topic may be the combination of Wordnet

synsets, or/and

The hidden co-occurrence structure 5

Introduction (contd.)

Wordnet-Enhanced Topic Model

Incorporate Wordnet synsets into topic

models

Wordnet synsets affect the prior of topics

Multinomial-probit-like setting for prior

Wordnet synset influence topic inference at

token-level

Document-level random effects for document-

wide topic tendency

Inference using Gibbs sampling

6

Literature Review

Wordnet

Latent Dirichlet Allocation (LDA)

LDA with Dirichlet Forest Prior

Concept-Topic Model

LDA with Wordnet

7

Wordnet

WordNet is a large lexical database of

English.

POS: Nouns, verbs, adjectives and adverbs

Words are organized by synsets

A synset expresses a distinct concept

Synsets are interlinked by means of

conceptual-semantic and lexical relations

Synsets form a network

Useful for computational linguistics and

natural language processing

8

Wordnet (Contd.)

Important difference between Wordnet and

thesaurus

WordNet interlinks not just word forms (strings

of letters) but specific senses of words

WordNet labels the semantic relations among

words, whereas the groupings of words in a

thesaurus does not follow any explicit pattern

other than meaning similarity

9

WordNet (Contd.)

A lexical semantic network relating word forms and

lexicalized concepts (i.e., concepts that speakers have

adopted word forms to express)

Main relationshyponymy/troponymy (kind-of/way-to),

meronymy (part-whole), synonymy, antonymy

Predominantly hierarchical, few relations across

grammatical class, glosses & example sentences do not

participate in network

Nouns organized under 9 unique beginners

Command-line interface & C library

Prehistoric (but greppable!) db format

Lexical Matrix

Creation of Synsets

Three principles: Minimality

Coverage

Replacability

Synsets

{house} is ambiguous. {house, home} has the sense of a social unit living together; Is this the minimal unit?

{family, house , home} will make the unit completely unambiguous.

For coverage:

{family, household, house, home} ordered according to frequency.

Replacability of the most frequent words is a requirement.

Synset creation

From first principles

Pick all the senses from good standard dictionaries.

Obtain synonyms for each sense.

Needs hard and long hours of work.

Wordnet Statistics (Version 2.1)

POS Unique

Strings Synsets

Total

Word-Sense

Pairs

Noun 117097 81426 145104

Verb 11488 13650 24890

Adjective 22141 18877 31302

Adverb 4601 3644 5720

Totals 155327 117597 207016

15

Wordnet Example

Fake (n) has three senses:

Something that is counterfeit; not what is

seems to be (synonyms: sham, postiche)

A person who makes deceitful pretenses

(synonyms: imposter, impotor, pretender,

faker, )

[Football] A deceptive move made by a

football player (synonym: juke)

16

juke

sham

postichefake, n

imposterimpostor

pretender

fakerfraud

shammerrole player

pseudopseud

entity

causal agent physical objectliving thing

organism, being

person

bad person

wrongdoer

deceiver

whole thing, unit

artifact

creation

representation

copy

imitation

act, human action

action

choice, selection

decision

move

tacticalmaneuver

feint

Wordnet Example (Contd.)

Unique beginner synsets

Topic Models

Latent variable models are useful in

discovering hidden structures in text data

Latent Semantic Indexing using singular value

decomposition (SVD) (Deerwester et al. 1990)

Probabilistic Latent Semantic Indexing (pLSI)

(Hofmann 1999)

Latent Dirichlet allocation (LDA) (Blei 2003)

19

Topic Models (Contd.)

LDA addresses the shortcoming of its

predecessors

SVD may contain negative factor loadings,

which makes the result hard to explain

pLSI (aspect model) : The number of

parameters grow linearly w.r.t. the number of

documents

Lead to model overfitting

LDA outperforms pLSI in terms of testing

probability (perplexity) 20

LDA Generative Process

21

LDA Inference Problem

22

LDA Model

23

LDA Model (Contd.)

24

LDA Model (Contd.)

25

LDA: Intractable Inference

26

Model Estimation Methods

Model Latent Z Latent Other

Parameter

Collapsed

Gibbs Sampling LDA Sample Integrate

Out Integrate out

Stochastic EM TOT Sample Integrate Out

Integrate out ; maximize w.r.t

other parameters

Variational

Bayes LDA and

DTM Assume

Independent Assume

Independent

(retain

sequential

structure in

DTM)

Maximize

Augmented

Gibbs Sampling WNTM

(This

Study)

Sample Sample Integrate out ;

sample other

parameters 27

Collapsed Gibbs Sampling

28

, =

=

Marginalize

29

Marginalize

30

Joint Probability

31

Posterior Probabilty

32

Posterior Probability (Contd.)

33

Limitations of The LDA Model

Additional meta-data information cannot

be included into the model

Partially addressed problem:

The author-topic model (AT) (Rosen-Zvi

2010) and Dirichlet-multinomial regression

(DMR) (Mimno and McCallum, 2008)

The AT model delivers worse performance

compared to the LDA model

Except when testing articles are very short

The AT model is not a general framework to

include arbitrary document-level meta data

into the model 34

LDA with Dirichlet Forest Prior

Dirichlet Forest Prior can be used to

incorporate prior knowledge

Mixture of Dirichlet tree distributions

Two basic types of knowledge

Must-Links: two words have similar probability

within any topic

Cannot-Links: two words should not both have

large probability within any topic

35 Andrzejewski, Zhu, and Craven, ICML 2009

LDA with Dirichlet Forest Prior (Contd.)

Additional types of knowledge:

Split: separate two or more sets of word from a

single topic into different topics by placing must-

links within the sets and cannot-links between

them

Merge: combine two or more sets of words using

must-links

Isolate: placing must-links within the common set,

and placing cannot-link between the common set

and the other high-probability words from all topics

36

Dirichlet Tree Distribution For Must-Link

A Dirichlet distribution is a composition of

Dirichlet distribution

(a) A, B, and C are vocabulary, start

sampling f

Recommended

View more >