© 2007 cios / pedrycz / swiniarski / kurgan chapter 14 text mining cios / pedrycz / swiniarski /...

63
© 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

Upload: christiana-tate

Post on 11-Jan-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

Chapter 14TEXT MINING

Cios / Pedrycz / Swiniarski / KurganCios / Pedrycz / Swiniarski / Kurgan

Presented by: Yulong ZhangNov. 16, 2011

Page 2: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 2

Outline

• Introduction• Information Retrieval

– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Misc

• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback

Page 3: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

What we have learned – structured data

3

Feature extractionFeature selectionDiscretizationClusteringClassification……

Page 4: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

But what should we do for these data?

4

Page 5: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 5

Introduction

So far we have focused on data mining methods for analysis and extraction of useful knowledge from structured data such as flat files, relational, transactional etc. data.

In contrast, we are text mining is concerned with analysis of text databases that consist of mainly semi-structured or unstructured data such as:

– collections of articles, research papers, e-mails, blogs, discussion forums, and WWW pages

Page 6: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 6

Introduction

Semi-structured data is neither completely structured (e.g., relational table) nor completely unstructured (e.g., free text)

– semi-structured document has:

• some structured fields such as title, list of authors, keywords, publication date, category, etc.

• and some unstructured fields such as abstract and contents

Page 7: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

Important differences between the two:

• The number of features

• Features of semi-structured data are sparse

• Growing speed in size (only a tiny portion is relative and even less is useful)

7

Structured data Semi-structured data

~100 - ~102 ~103 of features

Page 8: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

8

Introduction

There are three main types of retrieval within the knowledge discovery process framework:

– data retrieval - retrieval of structured data from DBMS and data warehouses (Chapter 6)

– information retrieval - concerns organization and retrieval of information from large collections of semi-structured or unstructured text-based databases and the web

– knowledge retrieval - generation of knowledge from (usually) structured data (Chapters 9-13)

SELECT * FROM xx WHERE yy = zzSELECT * FROM xx WHERE yy = zz

IF xx THEN yy ELSE zzIF xx THEN yy ELSE zz

Page 9: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 9

Information Retrieval

...[the] actions, methods and procedures for recovering stored data to provide information on a given subject

ISO 2382/1 (1984)

Page 10: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 10

Information RetrievalDefinitions

– database• a collection of documents

– document• a sequence of terms in a natural language that expresses ideas

about some topic

– term• a semantic unit, a word, phrase, or root of a word

– query• a request for documents that cover a particular topic of interest to

the user of an IR system

Page 11: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 11

ObamaObama

TermDocument

Database

Next presidentNext president

Page 12: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 12

Information RetrievalDefinitions

– IR system• its goal is to find relevant documents in response to the user’s

request

• performs matching between the language of the query and the language of the document

– simple word matching does not work since the same word has many different semantic meanings

» e.g. MAKE

“to make a mistake”

“make of a car”

“to make up excuses”

“to make for an exit”

“it is just a make-believe”

Also, one word may have many morphological variants: make, makes, made, making

Also, one word may have many morphological variants: make, makes, made, making

Page 13: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 13

Information RetrievalDefinitions

• other problems– consider query “Abraham Lincoln”

should it return document that contains the following sentences: “Abraham owns a Lincoln. It is a great car.”?

– consider query “what is the funniest movie ever made”: how can the IR system know what the user’s idea of a funny movies is?

• difficulties include inherent properties of natural language, high expectations of the user, etc.

– a number of mechanisms were developed to cope with them

No wait! Data mining is an interesting course!Wait! Is data mining an interesting course? No! No wait! Data mining is an interesting course!Wait! Is data mining an interesting course? No!

Page 14: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 14

Information Retrieval

Cannot provide one “best” answer to the user

– many algorithms provide one “correct” answer, such as SELECT Price FROM Sales WHERE Item = “book”

or: find shortest path from node A to node B

– IR on the other hand provides a range of possibly best answers and lets the user to choose

• query “Lincoln” may return information about– Abraham Lincoln

– Lincoln dealership

– The Lincoln Memorial

– The University of Nebraska-Lincoln

– The Lincoln University in New Zealand

IR systems do not give just one right answer but perform an approximate search that returns multiple, potentially correct answers.

IR systems do not give just one right answer but perform an approximate search that returns multiple, potentially correct answers.

Page 15: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 15

Information Retrieval

IR system provides information based on the stored data

– the key is to provide some measurement of relevance between the stored data and the user’s query

• i.e., the relation between requested information and retrieved information

• given a query, the IR system has to check whether the stored information is relevant to the query

Page 16: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 16

Information Retrieval

IR system provides information based on the stored data

– IR systems use heuristics to find relevant information• they find a “close” answer and use heuristic to measure its

closeness to the “right” answer

• the inherent difficulty is that very often we do not know what the right answer is!

– we just measure how close to the right answer we can come

– the solution involves using measures of precision and recall• they are used to measure the “accuracy” of IR systems

Will be discussed laterWill be discussed later

Page 17: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 17

Outline

• Introduction• Information Retrieval

– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Misc

• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback

Page 18: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 18

Architecture of IR Systems

D1: Lincoln Park Zoo is everyone’s zoo …D2: This website includes a biography, photographs, and lots …D3: Biography of Abraham Lincoln, the sixteenth President …

Welcome to Lincoln Center for Perfor …

D1: <DOC> <DOCNO> 1</DOCNO><TEXT>Lincoln Park Zoo is every …</DOC>D2: <DOC> <DOCNO> 2</DOCNO><TEXT>This website includes a biography, photographs, …</DOC> ….

sourcedocuments

textualdata

taggeddata

invertedindex

Welcome to the Univers. of Nebraska

March 31, 1849Lincoln returns …

lincoln: D1, D2, D13, D54, …zoo: D2, D43, D198, …website: D1, D2, D3, D4, …university: D4, D8, D14, … …

where is the University of Nebraska Lincoln?

query

user

whereuniversitynebraskalincoln

transformedquery

similaritymeasure

similarity (D1, query) = 0.15similarity (D2, query) = 0.10similarity (D3, query) = 0.14similarity (D4, query) = 0.75…

list ofrankeddocuments

document D4document D52document D12document D134…

database

Page 19: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 19

Architecture of IR Systems

– Search database• is organized as an inverted index file of the significant

character strings that occur in the stored tagged data– the inverted file specifies where the strings occur in the text

– the strings include words after excluding determiners, conjunctions, and prepositions - known as STOP-WORDS

» determiner is a non-lexical element preceding a noun in a noun phrase, e.g., the, that, two, a, many, all

» conjunction is used to combine multiple sentences, e.g., and, or » preposition links nouns, pronouns, and phrases to other words in a

sentence, e.g., on, beneath, over, of, during, beside

– the stored words use a common form» they are stemmed to obtain the ROOT FORM by removing common

prefixes and suffixes» synonyms to a given word are also found and used

Disease, diseased, diseases, illness, unwellness, malady, sickness, …Disease, diseased, diseases, illness, unwellness, malady, sickness, …

Page 20: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 20

Architecture of IR Systems

– Query• is composed of character strings combined by Boolean operators

and additional features such as contextual or positional operators– query is also preprocessed in terms of removing determiners,

conjunctions, and prepositions

– No linguistic analysis of the semantics of the stored texts, or of the queries, is performed

• thus the IR systems are domain-independent

Page 21: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

How does it fit in the CS taxonomy?

Computers

Artificial Intelligence AlgorithmsDatabases Networking

Robotics SearchNatural Language Processing

InformationRetrieval

Machine Translation

Language Analysis

Semantics Parsing

By Rada Mihalcea, “Natural Language Processing”

Page 22: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 22

Outline

• Introduction• Information Retrieval

– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Misc

• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback

Page 23: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 23

Linguistic Preprocessing

Creation of the inverted index requires linguistic preprocessing, which aims at extracting important terms from a document represented as the bag of words.

Term extraction involves two main operations:– Removal of stop words– Stemming

Page 24: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 24

Linguistic Preprocessing

• Removal of stop words– stop words are defined as terms that are irrelevant although

they occur frequently in the documents: • determiner is a non-lexical element preceding a noun in a noun

phrase, and includes articles (a, an, the), demonstratives, when used with noun phrases (this, that, these, those), possessive determiners (her, his, its, my, our, their, your) and quantifiers (all, few, many, several, some, every)

• conjunction is a part of speech that is used to combine two words, phrases, or clauses together, and includes coordinating conjunctions (for, and, nor, but, or, yet, so), correlative conjunctions (both … and, either … or, not (only) … but (… also)), and subordinating conjunctions (after, although, if, unless, because)

Page 25: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 25

Linguistic Preprocessing

• Removal of stop words– stop words are defined as terms that are although they may

occur frequently in the documents: • preposition links nouns, pronouns and phrases to other words

in a sentence (on, beneath, over, of, during, beside, etc.)

• Finally, the stop words include some custom-defined words, which are related to the subject of the database

e.g., for a database that lists all research papers related to brain modeling, the words brain and model should be removed

Page 26: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 26

Linguistic Preprocessing

• Stemming– words that appear in documents often have many

morphological variants– each word that is not a stop word is reduced into its

corresponding stem word (term)• words are stemmed to obtain root form by removing common

prefixes and suffixes

• in this way, we can identify groups of corresponding words where the words in the group are syntactical variants of each other, and collect only one word per group

– for instance, words disease, diseases, and diseased share a common stem term disease, and can be treated as different occurrences of this word

Page 27: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 27

•For English it is not a big problem - publicly available algorithms give good results. Most widely used is Porter stemmer at http://www.tartarus.org/~martin/PorterStemmer/

E.g. in Slovenian language 10-20 different forms correspond to the same word: (“to laugh” in Slovenian): smej, smejal, smejala, smejale, smejali, smejalo, smejati, smejejo, smejeta, smejete, smejeva, smejes, smejemo, smejis, smeje, smejoc, smejta, smejte, smejva

In Chinese…… 一切尽在不言中

Page 28: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 28

Outline

• Introduction• Information Retrieval

– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Misc

• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback

Page 29: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 29

Measures of Text Retrieval

Let us suppose that an IR system returned a set of documents to the user’s query.

We define measures that allow to evaluate how accurate (correct) the system’s answer was.

Two types of documents can be found in a database:– relevant documents, which are relevant to the user’s query– retrieved documents, which are returned to the user by the

system

Page 30: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 30

• Precision– evaluates ability to retrieve documents that are mostly

relevant

• Recall– evaluates ability of the search to find all of the relevant items

in the corpus.

documents relevant of number Total

X recall

documents retrieved of number Total

X precision

Precision and Recall Relevant

documents

Retrieved documents

Entire collection of documents

x

retrieved not retrievedirrelevant retrieved and irrelevant not retrieved and irrelevantrelevant X = retrieved and relevant not retrieved and relevant

Page 31: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 31

Precision and Recall

• Trade-off between precision and recall

1

recall

pre

cis

ion

the ideal case

returns relevant documents but

misses many of them

1

returns most relevantdocuments but includes

lots of unwanted documents

0

Page 32: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 32

Computing Recall

• Number of relevant documents is often not available, and thus we use techniques to estimate it, such as

– sampling across the database and performing relevance judgment on the documents

– applying different retrieval algorithms to the same database for the same query

• the relevant documents are the aggregate of all found documents

– the generated list is a golden standard to compute recall

Page 33: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 33

Computing Recall and Precision

• For a given query– generate the ranked list of retrieved documents

– adjust a threshold on the ranked list to generate different sets of retrieved documents, thus with different recall/precision measures

• mark each document in the ranked list that is relevant according to the gold standard

• compute recall and precision for each position in the ranked list that contains a relevant document

still missing one relevant document, and thus will never reach 100% recall

rank doc # relevant? precision/recall

1 134 yes P = 1/1 = 1, R = 1/7 = 0.14

2 1987 yes P = 2/2 = 1, R = 2/7 = 0.29

3 21

4 8712 yes P = 3/4 = 0.75, R = 3/7 = 0.43

5 112

6 567 yes P = 4/6 = 0.67, R = 4/7 = 0.57

7 810

8 12 yes P = 5/8 = 0.63, R = 5/7 = 0.71

9 346

10 478

11 7834 yes P = 6/11 = 0.55, R = 6/7 = 0.86

12 3412

total # of relevant docs = 7

Precision

Recall

Page 34: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

34

R=3/6=0.5; P=3/4=0.75

Computing Recall/Precision Points: Example 1

n doc # relevant

1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990

Let total # of relevant docs = 6Check each new recall point:

R=1/6=0.167; P=1/1=1

R=2/6=0.333; P=2/2=1

R=5/6=0.833; p=5/13=0.38

R=4/6=0.667; P=4/6=0.667

Missing one relevant document.

Never reach 100% recall

Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)

Page 35: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

35

R=3/6=0.5; P=3/5=0.6

Computing Recall/Precision Points: Example 2

n doc # relevant

1 588 x2 5763 589 x4 3425 590 x6 7177 9848 772 x9 321 x10 49811 11312 62813 77214 592 x

Let total # of relevant docs = 6Check each new recall point:

R=1/6=0.167; P=1/1=1

R=2/6=0.333; P=2/3=0.667

R=6/6=1.0; p=6/14=0.429

R=4/6=0.667; P=4/8=0.5

R=5/6=0.833; P=5/9=0.556

Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)

Page 36: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

36

Compare Two or More Systems

• The curve closest to the upper right-hand corner of the graph indicates the best performance

Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)

Page 37: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

37

R- Precision

• Precision at the R-th position in the ranking of results for a query that has R relevant documents.

n doc # relevant

1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990

R = # of relevant docs = 6

R-Precision = 4/6 = 0.67

Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)

Page 38: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

38

F-Measure

• One measure of performance that takes into account both recall and precision.

• Harmonic mean of recall and precision:

• Compared to arithmetic mean, both need to be high for harmonic mean to be high.

PRRP

PRF 11

22

Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)

Page 39: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

39

E Measure (parameterized F Measure)

• A variant of F measure that allows weighting emphasis on precision over recall:

• Value of controls trade-off: = 1: Equally weight precision and recall (E=F). > 1: Weight recall more. < 1: Weight precision more.

PRRP

PRE

1

2

2

2

2

)1()1(

Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)

Page 40: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

Mean Average Precision(MAP)

• Average Precision: Average of the precision values at the points at which each relevant document is retrieved.– Ex1: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633– Ex2: (1 + 0.667 + 0.6 + 0.5 + 0.556 + 0.429)/6 = 0.625

• Mean Average Precision: Average of the average precision value for a set of queries.

40Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)

Page 41: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

Non-Binary Relevance

• Documents are rarely entirely relevant or non-relevant to a query

• Many sources of graded relevance judgments– Relevance judgments on a 5-point scale– Multiple judges– Click distribution and deviation from expected

levels (but click-through != relevance judgments)

41Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)

Page 42: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

Cumulative Gain

• With graded relevance judgments, we can compute the gain at each rank.

• Cumulative Gain at rank n:

(Where reli is the graded relevance of the document at position i)

42Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)

Page 43: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 43

Outline

• Introduction• Information Retrieval

– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Misc

• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback

Page 44: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 44

How to Measure Text Similarity?

It is a well-studied problem– metrics use a “bag of words” model

• it completely ignores word order and syntactic structure

• it treats both document and query as a bag of independent words– common “stop words” are removed– words are stemmed to reduce them to their root form

• the preprocessed words are called terms

– vector-space model is used to calculate similarity measure between documents and a query, and between two documents

Page 45: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 45

The Vector-Space Model

Assumptions:– Vocabulary: a set of all distinct terms that remain after

preprocessing documents in the database; it contains t index terms

• these “orthogonal” terms form a vector space

– each term, i, in either a document or query, j, is given a real-valued weight wij.

• documents and queries are expressed as t-dimensional vectorsdj = (w1j, w2j, …, wtj)

Page 46: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 46

3D Example of the Vector-Space Model

T3

D1 = 2T1+ 6T2 + 5T3

D2 = 5T1 + 5T2 + 2T3

Q = 0T1 + 0T2 + 2T3

5

6

2 5

2

5

T2

T1

Example– document D1 = 2T1 + 6T2 + 5T3

– document D2 = 5T1 + 5T2 + 2T3

– query Q1 = 0T1 + 0T2 + 2T3

– which document iscloser to query?

• how to measure it?– Distance?– Angle?– Projection?

Page 47: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 47

The Vector-Space Model

• Collection of n documents– represented in the vector-space model by a term-document

matrix– a cell in the matrix corresponds to the weight of a term in

the document• value of zero means that the term does not exist in the

document

• Next we explain how the weights are computed

T1 T2 … Tt

D1 w11 w21 … wt1

D2 w12 w22 … wt2

: : : : : : : :Dn w1n w2n … wtn

Page 48: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 48

Term Weights

Frequency of a term– more frequent terms in a document are more important

• they are more indicative of the topic of a document

fij = frequency of term i in document j

– the frequency is normalized by dividing by the frequency of the most common term in the document

tfij = fij / maxi(fij)

Page 49: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 49

Term WeightsInverse document frequency

– used to indicate the term’s discriminative power• terms that appear in many different documents are less indicative for a specific

topic

dfi = document frequency of term i = # of documents containing term I

idfi = inverse document frequency of term i = log2 (N / dfi)

where N is the total # of documents in the database, and log2 is used to dampen the effect relative to tfij

Page 50: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 50

TF-IDF WeightingTerm frequency-inversed document frequency (tf-idf)

weighting

wij = tfij idfi = tfij log2 (N / dfi)

– the highest weight is assigned to terms that occur frequently in the document but rarely in the rest of the database

• some other ways of determining term weights have also been proposed

– the tf-idf weighting was found to work very well through extensive experimentations, and thus it is widely used

Page 51: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 51

Consider the following document:“data cube contains x data dimension, y data dimension, and z data dimension”

where grey color stands for “ignored” letters, and other colors indicate distinct terms

– the frequencies are: data (4), dimension (3), cube (1), contain (1)

– we assume that the entire collection contains 10,000 documents and the document frequencies of these four terms are

data (1300), dimension (250), cube (50), contain (3100)– then:data: tf = 4/4 idf = log2(10000/1300) = 2.94 tf-idf = 2.94dimension: tf = 3/4 idf = log2(10000/250) = 5.32 tf-idf = 3.99cube: tf = 1/4 idf = log2(10000/50) = 7.64 tf-idf = 1.91contain: tf = 1/4 idf = log2(10000/3100) = 1.69 tf-idf = 0.42

Example of TF-IDF Weighting

Page 52: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 52

Outline

• Introduction• Information Retrieval

– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Misc

• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback

Page 53: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 53

Text Similarity Measure

Text similarity measure is a function used to compute degree of similarity between two vectors

– usually is used to measure similarity between the query and each of the documents from the database

• it is used to rank the documents– the order indicates their relevance to the query

• usually a threshold is used to control the number of retrieved relevant documents

Page 54: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 54

Is defined as the cosine of theangle between two vectors

where t is # of terms

– exampleD1 = 2T1 + 6T2 + 5T3, D2 = 5T1 + 5T2 + 2T3, Q1 = 0T1 + 0T2 + 2T3

document D1 is twice more similar to the query than document D2

Cosine Similarity Measure

27.0454

4

)400()42525(

)220505(),( 2

QDsimilarity

T3

D1 = 2T1+ 6T2 + 5T3

D2 = 5T1 + 5T2 + 2T3Q = 0T1 + 0T2 + 2T3

5

6

2 5

2

5

T2

T1

t

i

t

i

t

ij

ww

ww

qd

qd

iqij

iqij

j

jqdsimilarity

1 1

22

1

)(),(

62.0465

10

)400()25364(

)250602(),( 1

QDsimilarity

Page 55: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 55

Outline

• Introduction• Information Retrieval

– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Misc

• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback

Page 56: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

If we have extra time…I can further talk about:

• Different levels of text mining

(word-> sentence -> segment -> …)

• The application of text mining in other areas

(Spam and malicious executables detection)

56

Page 57: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 57

Outline

• Introduction• Information Retrieval

– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Levels of Text Mining– Misc

• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback

Page 58: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 58

Improving IR Systems

The basic design of the IR systems described so far can be enhanced to increase precision and recall, and to improve the term-document matrix

– The former issue is often addressed by using relevance feedback mechanism, while the latter is addressed by using latent semantic indexing

Page 59: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 59

Improving IR Systems

Latent Semantic Indexing

– improves effectiveness of the IR system by retrieving documents that are more relevant to a user's query through manipulation of the term-document matrix• original term-document matrix is often too large for the available

computing resources, is presumed noisy (some anecdotal occurrences of terms should be eliminated) and too sparse with respect to the "true" term-document matrix

– therefore, the original matrix is approximated by a smaller, “de-noisified” matrix• the new matrix is computed using singular value decomposition

(SVD) technique (Chapter 7)

Page 60: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 60

Improving IR Systems

Relevance Feedback– modification of the

information searchprocess to improveaccuracy

– it adds terms to the initial query that may match relevant documents

1. perform IR search on the initial query

2. get feedback from the user as to what documents are relevant and find new (with respect to the initial query) terms from known relevant documents

3. add new terms to the initial query to form a new query

4. repeat the search using the new query

5. return a set of relevant documents based on the new query

6. user evaluates the returned documents

IR system

user

query new query

new terms

Matching documents

1

2

3

4

5

6

3

Page 61: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 61

Improving IR Systems

Relevance Feedback

– performed manually• user manually identifies relevant documents, and new terms are

selected either manually or automatically

– performed automatically• relevant documents are identified automatically by using the top

ranked documents, and new terms are selected automatically1. identification of the N top-ranked documents2. identification of all terms from the N top-ranked documents3. selection of the feedback terms4. merging the feedback terms with the original query5. identifying the top-ranked documents for the modified query

Page 62: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 62

Improving IR Systems

Relevance Feedback

• Pros– it usually improves average precision by increasing the

number of good terms in the query

• Cons– requires more computational work– it can decrease effectiveness

• one bad new word can undo the good caused by lots of good words

Page 63: © 2007 Cios / Pedrycz / Swiniarski / Kurgan Chapter 14 TEXT MINING Cios / Pedrycz / Swiniarski / Kurgan Presented by: Yulong Zhang Nov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 63

References

Baeza-Yates, R. and Ribeiro-Neto, B 1999. Modern Information Retrieval, Addison Wesley

Feldman, R. and Sanger, J. 2006. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press

Hotho, A., Nürnberger, A. and Paaß, G. 2005. A Brief Survey of Text Mining, GLDV-Journal for Computational Linguistics and Language Technology, 20(1):19-62

Salton, G. and McGill, M. 1982. Introduction to Modern Information Retrieval, McGraw-Hill