information retrieval search engine technology (5&6) prof. dragomir r. radev

52
Information Retrieval Search Engine Technology (5&6) http://tangra.si.umich.edu/clair/ir09 Prof. Dragomir R. Radev [email protected]

Upload: stephanie-simon

Post on 17-Jan-2018

218 views

Category:

Documents


0 download

DESCRIPTION

SET/IR – W/S 2009 … 9. Text classification Naïve Bayesian classifiers Decision trees …

TRANSCRIPT

Page 1: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Information RetrievalSearch Engine Technology

(5&6)http://tangra.si.umich.edu/clair/ir09

Prof. Dragomir R. [email protected]

Page 2: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Final projects• Two formats:

– A software system that performs a specific search-engine related task. We will create a web page with all such code and make it available to the IR community.

– A research experiment documented in the form of a paper. Look at the proceedings of the SIGIR, WWW, or ACL conferences for a sample format. I will encourage the authors of the most successful papers to consider submitting them to one of the IR-related conferences.

• Deliverables:– System (code + documentation + examples) or Paper (+ code,

data)– Poster (to be presented in class)– Web page that describes the project.

Page 3: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

SET/IR – W/S 2009

…9. Text classification

Naïve Bayesian classifiers Decision trees…

Page 4: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Introduction

• Text classification: assigning documents to predefined categories: topics, languages, users

• A given set of classes C• Given x, determine its class in C• Hierarchical vs. flat• Overlapping (soft) vs non-overlapping (hard)

Page 5: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Introduction

• Ideas: manual classification using rules (e.g., Columbia AND University EducationColumbia AND “South Carolina” Geography

• Popular techniques: generative (knn, Naïve Bayes) vs. discriminative (SVM, regression)

• Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x)

• Discriminative: model p(y|x) directly.

Page 6: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Bayes formula

)()|()()|(

ApBApBpABp

Full probability

Page 7: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Example (performance enhancing drug)

• Drug(D) with values y/n• Test(T) with values +/-• P(D=y) = 0.001• P(T=+|D=y)=0.8• P(T=+|D=n)=0.01• Given: athlete tests positive• P(D=y|T=+)=

P(T=+|D=y)P(D=y) / (P(T=+|D=y)P(D=y)+P(T=+|D=n)P(D=n)=(0.8x0.001)/(0.8x0.001+0.01x0.999)=0.074

Page 8: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Naïve Bayesian classifiers• Naïve Bayesian classifier

• Assuming statistical independence

• Features = words (or phrases) typically

),()()|,...,(),...,|(

,...21

2121

k

kk FFFP

CdPCdFFFPFFFCdP

k

j j

k

j jk

FP

CdPCdFPFFFCdP

1

121

)(

)()|(),...,|(

Page 9: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Example

• p(well)=0.9, p(cold)=0.05, p(allergy)=0.05– p(sneeze|well)=0.1– p(sneeze|cold)=0.9– p(sneeze|allergy)=0.9– p(cough|well)=0.1– p(cough|cold)=0.8– p(cough|allergy)=0.7– p(fever|well)=0.01– p(fever|cold)=0.7– p(fever|allergy)=0.4

Example from Ray Mooney

Page 10: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Example (cont’d)

• Features: sneeze, cough, no fever• P(well|e)=(.9) * (.1)(.1)(.99) / p(e)=0.0089/p(e)• P(cold|e)=(.05) * (.9)(.8)(.3) / p(e)=0.01/p(e)• P(allergy|e)=(.05) * (.9)(.7)(.6) / p(e)=0.019/p(e)• P(e) = 0.0089+0.01+0.019=0.379• P(well|e)=.23• P(cold|e)=.26• P(allergy|e)=.50

Example from Ray Mooney

Page 11: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Issues with NB• Where do we get the values – use

maximum likelihood estimation (Ni/N)• Same for the conditionals – these are based on

a multinomial generator and the MLE estimator is (Tji/Tji)

• Smoothing is needed – why?• Laplace smoothing ((Tji+1)/Tji+1))• Implementation: how to avoid floating point

underflow

)( CdP

Page 12: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Spam recognitionReturn-Path: <[email protected]>X-Sieve: CMU Sieve 2.2From: "Ibrahim Galadima" <[email protected]>Reply-To: [email protected]: [email protected]: Tue, 14 Jan 2003 21:06:26 -0800Subject: Gooday

DEAR SIR

FUNDS FOR INVESTMENTS

THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HADNO PREVIOUS CORRESPONDENCE WITH YOU

I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENTNATIONAL ELECTORAL COMMISSION INEC I GOT YOURCONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLEPERSON WITH WHOM TO HANDLE A VERY CONFIDENTIALTRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED ATTWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATESDOLLARS US$20M TO A SAFE FOREIGN ACCOUNT

THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITHARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OFOVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A

Page 13: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

SpamAssassin

• http://spamassassin.apache.org/• http://spamassassin.apache.org/

tests_3_1_x.html

Page 14: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Feature selection: The 2 test• For a term t:

• C=class, it = feature• Testing for independence:

P(C=0,It=0) should be equal to P(C=0) P(It=0)– P(C=0) = (k00+k01)/n– P(C=1) = 1-P(C=0) = (k10+k11)/n– P(It=0) = (k00+K10)/n– P(It=1) = 1-P(It=0) = (k01+k11)/n

It0 1

C 0 k00 k011 k10 k11

Page 15: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Feature selection: The 2 test

• High values of 2 indicate lower belief in independence.

• In practice, compute 2 for all words and pick the top k among them.

))()()(()(

0010011100011011

2011000112

kkkkkkkkkkkknΧ

Page 16: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Feature selection: mutual information

• No document length scaling is needed• Documents are assumed to be generated

according to the multinomial model• Measures amount of information: if the

distribution is the same as the background distribution, then MI=0

• X = word; Y = class

x y yPxP

yxPyxPYXMI)()(

),(log),(),(

Page 17: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Well-known datasets

• 20 newsgroups– http://people.csail.mit.edu/u/j/jrennie/public_html/

20Newsgroups/ • Reuters-21578

– http://www.daviddlewis.com/resources/testcollections/reuters21578/

– Cats: grain, acquisitions, corn, crude, wheat, trade…• WebKB

– http://www-2.cs.cmu.edu/~webkb/ – course, student, faculty, staff, project, dept, other– NB performance (2000)– P=26,43,18,6,13,2,94– R=83,75,77,9,73,100,35

Page 18: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Evaluation of text classification

• Microaveraging – average over classes• Macroaveraging – uses pooled table

Page 19: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Vector space classification

x1

x2

topic2topic1

Page 20: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Decision surfaces

x1

x2

topic2topic1

Page 21: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Decision trees

x1

x2

topic2topic1

Page 22: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Classification usingdecision trees

• Expected information need

• I (s1, s2, …, sm) = - pi log (pi)

• s = data samples• m = number of classes

Page 23: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

RID Age Income student credit buys?1 <= 30 High No Fair No

2 <= 30 High No Excellent No

3 31 .. 40 High No Fair Yes

4 > 40 Medium No Fair Yes

5 > 40 Low Yes Fair Yes

6 > 40 Low Yes Excellent No

7 31 .. 40 Low Yes Excellent Yes

8 <= 30 Medium No Fair No

9 <= 30 Low Yes Fair Yes

10 > 40 Medium Yes Fair Yes

11 <= 30 Medium Yes Excellent Yes

12 31 .. 40 Medium No Excellent Yes

13 31 .. 40 High Yes Fair Yes

14 > 40 Medium no excellent no

Page 24: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Decision tree induction

• I(s1,s2)= I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 == 0.940

Page 25: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Entropy and information gain

•E(A) = I (s1j,…,smj) S1j + … + smj

s

Entropy = expected information based on the partitioning intosubsets by A

Gain (A) = I (s1,s2,…,sm) – E(A)

Page 26: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Entropy

• Age <= 30s11 = 2, s21 = 3, I(s11, s21) = 0.971

• Age in 31 .. 40s12 = 4, s22 = 0, I (s12,s22) = 0

• Age > 40s13 = 3, s23 = 2, I (s13,s23) = 0.971

Page 27: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Entropy (cont’d)

• E (age) =5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = 0.694

• Gain (age) = I (s1,s2) – E(age) = 0.246

• Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048

Page 28: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Final decision tree

excellent

age

student credit

no yes no yes

yes

no

31 .. 40

> 40

yes fair

Page 29: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Other techniques

• Bayesian classifiers• X: age <=30, income = medium, student =

yes, credit = fair• P(yes) = 9/14 = 0.643• P(no) = 5/14 = 0.357

Page 30: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Example

• P (age < 30 | yes) = 2/9 = 0.222P (age < 30 | no) = 3/5 = 0.600P (income = medium | yes) = 4/9 = 0.444P (income = medium | no) = 2/5 = 0.400P (student = yes | yes) = 6/9 = 0.667P (student = yes | no) = 1/5 = 0.200P (credit = fair | yes) = 6/9 = 0.667P (credit = fair | no) = 2/5 = 0.400

Page 31: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Example (cont’d)• P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044• P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019• P (X | yes) P (yes) = 0.044 x 0.643 = 0.028• P (X | no) P (no) = 0.019 x 0.357 = 0.007

• Answer: yes/no?

Page 32: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

SET/IR – W/S 2009

…10. Linear classifiers Kernel methods Support vector machines…

Page 33: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Linear boundary

x1

x2

topic2topic1

Page 34: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Vector space classifiers

• Using centroids• Boundary = line that is equidistant from

two centroids

Page 35: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Generative models: knn

• Assign each element to the closest cluster• K-nearest neighbors

• Very easy to program• Tessellation; nonlinearity• Issues: choosing k, b?• Demo:

– http://www-2.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html

)(

),(),(qdkNNd

qcq ddsbdcscore

Page 36: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Linear separators

• Two-dimensional line:w1x1+w2x2=b is the linear separator

w1x1+w2x2>b for the positive class

bxwT

• In n-dimensional spaces:

Page 37: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Example 1

x1

x2

topic2topic1

w

Page 38: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Example 2• Classifier for “interest”

in Reuters-21578• b=0• If the document is

“rate discount dlr world”, its score will be0.67*1+0.46*1+(-0.71)*1+(-0.35)*1= 0.05>0

Example from MSR

wi xi wi xi

0.70 prime -0.71 dlrs

0.67 rate -0.35 world

0.63 interest -0.33 sees

0.60 rates -0.25 year

0.46 discount -0.24 group

0.43 bundesbank -0.24 dlr

Page 39: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Example: perceptron algorithm

Input:

Algorithm:

Output:

}1,1{,)),,(),...,,(( 111 i

Nnn yxyxyxS

ENDEND

1

take //mis0)( IF TO 1 FOR

0,0

1

0

kkxyww

xwyni

kw

iikk

iki

kw

Page 40: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

[Slide from Chris Bishop]

Page 41: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Linear classifiers• What is the major shortcoming of a

perceptron?• How to determine the dimensionality of the

separator?– Bias-variance tradeoff (example)

• How to deal with multiple classes?– Any-of: build multiple classifiers for each class– One-of: harder (as J hyperplanes do not

divide RM into J regions), instead: use class complements and scoring

Page 42: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Support vector machines

• Introduced by Vapnik in the early 90s.

Page 43: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Issues with SVM

• Soft margins (inseparability)• Kernels – non-linearity

Page 44: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

The kernel idea

before after

Page 45: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Example32:

),2,(),,(),( 2221

2132121 xxxxzzzxx

(mapping to a higher-dimensional space)

Page 46: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

The kernel trick

)',(',)',''2,')(,2,()'(),( 22221

21

2221

21 xxkxxxxxxxxxxxx T

dcxxxxk ))',()',(

)',tanh()',( xxkxxk

))2/('exp()',( 22 xxxxk

Polynomial kernel:

Sigmoid kernel:

RBF kernel:

Many other kernels are useful for IR:e.g., string kernels, subsequence kernels, tree kernels, etc.

Page 47: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

SVM (Cont’d)

• Evaluation:– SVM > knn > decision tree > NB

• Implementation– Quadratic optimization– Use toolkit (e.g., Thorsten Joachims’s

svmlight)

Page 48: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Semi-supervised learning

• EM• Co-training• Graph-based

Page 49: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Exploiting Hyperlinks – Co-training

• Each document instance has two sets of alternate view (Blum and Mitchell 1998)– terms in the document, x1– terms in the hyperlinks that point to the document, x2

• Each view is sufficient to determine the class of the instance– Labeling function that classifies examples is the

same applied to x1 or x2– x1 and x2 are conditionally independent, given the

class

[Slide from Pierre Baldi]

Page 50: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Co-training Algorithm

• Labeled data are used to infer two Naïve Bayes classifiers, one for each view

• Each classifier will– examine unlabeled data – pick the most confidently predicted positive and

negative examples– add these to the labeled examples

• Classifiers are now retrained on the augmented set of labeled examples

[Slide from Pierre Baldi]

Page 51: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Conclusion

• SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters.

• NB also good in many circumstances

Page 52: Information Retrieval Search Engine Technology (5&6)   Prof. Dragomir R. Radev

Readings

• MRS18• MRS17, MRS19