minor project

16
Context Based Search By Shatabdi Kundu (2010EET2553) Computer Technology,M.Tech IIT Delhi Email ID:[email protected] Project Guide: Prof.Santanu Chaudhury Electrical Engineering Department IIT Delhi Email ID:[email protected] June 22, 2011 Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 1of 16

Upload: indian-institute-of-technology-delhi

Post on 17-May-2015

1.144 views

Category:

Education


0 download

DESCRIPTION

Latent Dirichlet Allocation for Topic Modelling.

TRANSCRIPT

Page 1: Minor Project

Context Based Search

ByShatabdi Kundu (2010EET2553)

Computer Technology,M.TechIIT Delhi

Email ID:[email protected]

Project Guide:Prof.Santanu Chaudhury

Electrical Engineering DepartmentIIT Delhi

Email ID:[email protected]

June 22, 2011Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 1of 16

Page 2: Minor Project

Outline

Introduction to Topic Models- Probabilistic Modelling

Latent Dirichlet Allocation

Topic Discovery using Wordnet

Work Done

Results

Conclusion and Future Work

References

Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 2of 16

Page 3: Minor Project

Probabilistic Modelling

Treat data as observations that arise from a generativeprobabilistic process that includes hidden variables

For documents, the hidden variables reflect the thematicstructure of the collection

Infer the hidden structure using posterior inference

What are the topics that describe this collection?

Situate new data into the estimated model

How does this query or new document fit into the estimatedtopic structure?

Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 3of 16

Page 4: Minor Project

Intuition behind LDA

Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 4of 16

Page 5: Minor Project

Generative Process

Cast these intuitions into a generative probabilistic process

Each document is a random mixture of corpus-wide topics

Each word is drawn from one of those topics

Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 5of 16

Page 6: Minor Project

Graphical Models

Nodes are random variablesEdges denote possible dependenceObserved variables are shadedPlates denote replicated structure

Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 6of 16

Page 7: Minor Project

Graphical Models

Structure of the graph defines the pattern of conditionaldependence between the ensemble of random variables.

Eg. this graph corressponds to

p(y , x1...xN) = p(y)N∏

n=1

p(xn | y) (1)

Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 7of 16

Page 8: Minor Project

Latent Dirichlet Allocation

1 Draw each topic βk ∼ Dir(η), for k ε {1,.....,K}2 For each document:

1 Draw topic proportions θd ∼ Dir(α)2 For each word:

1 Draw Zd,n ∼ Mult(θd)2 Draw Wd,n ∼ Mult(βZd,n )

Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 8of 16

Page 9: Minor Project

Latent Dirichlet Allocation

From a collection of documents, infer

Per-word topic assignment Zd,n

Per-document topic proportions θdPer-corpus topic distributions βk

Use posterior expectations to perform the task at hand, e.ginformation retrieval,document similarity, etc.

Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 9of 16

Page 10: Minor Project

Topic Discovery using Wordnet

Lexical relations used for finding out the latent topics

synsets(synonym sets) as basic units

hyponymya semantic relation between word meaningsEg. {maple} is a hyponym of {tree}

hypernymyinverse of hyponymEg.{tree} is a hypernym of {maple}

Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 10of 16

Page 11: Minor Project

Work Done

I took a collection of 10 documents that had a total of around28K words

I removed the stop words and rare words along withpunctuation marks and numbers.

Then I modeled a 7-topic LDA model with this corpus

Now I had 7 topics with 5 most highly probable occuringwords from each topic.

I then used the lexical relations of Wordnet to identify thehidden topics using common parents of all the words in eachtopic.

Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 11of 16

Page 12: Minor Project

Results after training LDA model

This model only selects appropriate words within a topic butdoes not name the topic

Discovering the topic name is done using Wordnet

Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 12of 16

Page 13: Minor Project

Results after applying to Wordnet

The above result gives us the hidden topic names of the wordsthat comprised the documents.

This kind of model can be used for identifying topics whengiven only a word.

Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 13of 16

Page 14: Minor Project

Conclusion and Future Work

Now we will be working on searching based on topics(context)using this model.

Basically we will be dealing with geo-intent of the queries anddecide on the topic to which they belong for better retrieval ofinformation.

Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 14of 16

Page 15: Minor Project

References

Latent Dirichlet allocation. D. Blei, A. Ng, and M. Jordan.Journal of Machine Learning Research, 3:993-1022, January2003.

Jun Fu Cai, Wee Sun Lee, Yee Whye Teh. NUS-ML:Improving Word Sense Disambiguation Using Topic Features.SEMEVAL (2007).

David M. Blei, Jon D. McAuliffe. Supervised Topic Models.NIPS (2007).

Wordnet. http://www.shiffman.net/teaching/a2z/wordnet

Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 15of 16

Page 16: Minor Project

Thank You

Shatabdi Kundu :: 2010EET2553 Prof.Santanu Chaudhury 09 MAY 2011 16of 16