© 2007 cios / pedrycz / swiniarski / kurgan chapter 14 text mining cios / pedrycz / swiniarski /...

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

Chapter 14TEXT MINING

Cios / Pedrycz / Swiniarski / KurganCios / Pedrycz / Swiniarski / Kurgan

Presented by: Yulong ZhangNov. 16, 2011

© 2007 Cios / Pedrycz / Swiniarski / Kurgan 2

Outline

• Introduction• Information Retrieval

– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Misc

• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback


What we have learned – structured data

3

Feature extractionFeature selectionDiscretizationClusteringClassification……


But what should we do for these data?

4


Introduction

So far we have focused on data mining methods for analysis and extraction of useful knowledge from structured data such as flat files, relational, transactional etc. data.

In contrast, we are text mining is concerned with analysis of text databases that consist of mainly semi-structured or unstructured data such as:

– collections of articles, research papers, e-mails, blogs, discussion forums, and WWW pages


Introduction

Semi-structured data is neither completely structured (e.g., relational table) nor completely unstructured (e.g., free text)

– semi-structured document has:

• some structured fields such as title, list of authors, keywords, publication date, category, etc.

• and some unstructured fields such as abstract and contents


Important differences between the two:

• The number of features

• Features of semi-structured data are sparse

• Growing speed in size (only a tiny portion is relative and even less is useful)

7

Structured data Semi-structured data

~100 - ~102 ~103 of features


8

Introduction

There are three main types of retrieval within the knowledge discovery process framework:

– data retrieval - retrieval of structured data from DBMS and data warehouses (Chapter 6)

– information retrieval - concerns organization and retrieval of information from large collections of semi-structured or unstructured text-based databases and the web

– knowledge retrieval - generation of knowledge from (usually) structured data (Chapters 9-13)

SELECT * FROM xx WHERE yy = zzSELECT * FROM xx WHERE yy = zz

IF xx THEN yy ELSE zzIF xx THEN yy ELSE zz


Information Retrieval

...[the] actions, methods and procedures for recovering stored data to provide information on a given subject

ISO 2382/1 (1984)


Information RetrievalDefinitions

– database• a collection of documents

– document• a sequence of terms in a natural language that expresses ideas

about some topic

– term• a semantic unit, a word, phrase, or root of a word

– query• a request for documents that cover a particular topic of interest to

the user of an IR system


ObamaObama

TermDocument

Database

Next presidentNext president



– IR system• its goal is to find relevant documents in response to the user’s

request

• performs matching between the language of the query and the language of the document

– simple word matching does not work since the same word has many different semantic meanings

» e.g. MAKE

“to make a mistake”

“make of a car”

“to make up excuses”

“to make for an exit”

“it is just a make-believe”

Also, one word may have many morphological variants: make, makes, made, making

Also, one word may have many morphological variants: make, makes, made, making



• other problems– consider query “Abraham Lincoln”

should it return document that contains the following sentences: “Abraham owns a Lincoln. It is a great car.”?

– consider query “what is the funniest movie ever made”: how can the IR system know what the user’s idea of a funny movies is?

• difficulties include inherent properties of natural language, high expectations of the user, etc.

– a number of mechanisms were developed to cope with them

No wait! Data mining is an interesting course!Wait! Is data mining an interesting course? No! No wait! Data mining is an interesting course!Wait! Is data mining an interesting course? No!



Cannot provide one “best” answer to the user

– many algorithms provide one “correct” answer, such as SELECT Price FROM Sales WHERE Item = “book”

or: find shortest path from node A to node B

– IR on the other hand provides a range of possibly best answers and lets the user to choose

• query “Lincoln” may return information about– Abraham Lincoln

– Lincoln dealership

– The Lincoln Memorial

– The University of Nebraska-Lincoln

– The Lincoln University in New Zealand

IR systems do not give just one right answer but perform an approximate search that returns multiple, potentially correct answers.

IR systems do not give just one right answer but perform an approximate search that returns multiple, potentially correct answers.



IR system provides information based on the stored data

– the key is to provide some measurement of relevance between the stored data and the user’s query

• i.e., the relation between requested information and retrieved information

• given a query, the IR system has to check whether the stored information is relevant to the query



IR system provides information based on the stored data

– IR systems use heuristics to find relevant information• they find a “close” answer and use heuristic to measure its

closeness to the “right” answer

• the inherent difficulty is that very often we do not know what the right answer is!

– we just measure how close to the right answer we can come

– the solution involves using measures of precision and recall• they are used to measure the “accuracy” of IR systems

Will be discussed laterWill be discussed later


Outline





Architecture of IR Systems

D1: Lincoln Park Zoo is everyone’s zoo …D2: This website includes a biography, photographs, and lots …D3: Biography of Abraham Lincoln, the sixteenth President …

Welcome to Lincoln Center for Perfor …

D1: <DOC> <DOCNO> 1</DOCNO><TEXT>Lincoln Park Zoo is every …</DOC>D2: <DOC> <DOCNO> 2</DOCNO><TEXT>This website includes a biography, photographs, …</DOC> ….

sourcedocuments

textualdata

taggeddata

invertedindex

Welcome to the Univers. of Nebraska

March 31, 1849Lincoln returns …

lincoln: D1, D2, D13, D54, …zoo: D2, D43, D198, …website: D1, D2, D3, D4, …university: D4, D8, D14, … …

where is the University of Nebraska Lincoln?

query

user

whereuniversitynebraskalincoln

transformedquery

similaritymeasure

similarity (D1, query) = 0.15similarity (D2, query) = 0.10similarity (D3, query) = 0.14similarity (D4, query) = 0.75…

list ofrankeddocuments

document D4document D52document D12document D134…

database



– Search database• is organized as an inverted index file of the significant

character strings that occur in the stored tagged data– the inverted file specifies where the strings occur in the text

– the strings include words after excluding determiners, conjunctions, and prepositions - known as STOP-WORDS

» determiner is a non-lexical element preceding a noun in a noun phrase, e.g., the, that, two, a, many, all

» conjunction is used to combine multiple sentences, e.g., and, or » preposition links nouns, pronouns, and phrases to other words in a

sentence, e.g., on, beneath, over, of, during, beside

– the stored words use a common form» they are stemmed to obtain the ROOT FORM by removing common

prefixes and suffixes» synonyms to a given word are also found and used

Disease, diseased, diseases, illness, unwellness, malady, sickness, …Disease, diseased, diseases, illness, unwellness, malady, sickness, …



– Query• is composed of character strings combined by Boolean operators

and additional features such as contextual or positional operators– query is also preprocessed in terms of removing determiners,

conjunctions, and prepositions

– No linguistic analysis of the semantics of the stored texts, or of the queries, is performed

• thus the IR systems are domain-independent

How does it fit in the CS taxonomy?

Computers

Artificial Intelligence AlgorithmsDatabases Networking

Robotics SearchNatural Language Processing

InformationRetrieval

Machine Translation

Language Analysis

Semantics Parsing

By Rada Mihalcea, “Natural Language Processing”


Outline





Linguistic Preprocessing

Creation of the inverted index requires linguistic preprocessing, which aims at extracting important terms from a document represented as the bag of words.

Term extraction involves two main operations:– Removal of stop words– Stemming



• Removal of stop words– stop words are defined as terms that are irrelevant although

they occur frequently in the documents: • determiner is a non-lexical element preceding a noun in a noun

phrase, and includes articles (a, an, the), demonstratives, when used with noun phrases (this, that, these, those), possessive determiners (her, his, its, my, our, their, your) and quantifiers (all, few, many, several, some, every)

• conjunction is a part of speech that is used to combine two words, phrases, or clauses together, and includes coordinating conjunctions (for, and, nor, but, or, yet, so), correlative conjunctions (both … and, either … or, not (only) … but (… also)), and subordinating conjunctions (after, although, if, unless, because)



• Removal of stop words– stop words are defined as terms that are although they may

occur frequently in the documents: • preposition links nouns, pronouns and phrases to other words

in a sentence (on, beneath, over, of, during, beside, etc.)

• Finally, the stop words include some custom-defined words, which are related to the subject of the database

e.g., for a database that lists all research papers related to brain modeling, the words brain and model should be removed



• Stemming– words that appear in documents often have many

morphological variants– each word that is not a stop word is reduced into its

corresponding stem word (term)• words are stemmed to obtain root form by removing common

prefixes and suffixes

• in this way, we can identify groups of corresponding words where the words in the group are syntactical variants of each other, and collect only one word per group

– for instance, words disease, diseases, and diseased share a common stem term disease, and can be treated as different occurrences of this word


•For English it is not a big problem - publicly available algorithms give good results. Most widely used is Porter stemmer at http://www.tartarus.org/~martin/PorterStemmer/

E.g. in Slovenian language 10-20 different forms correspond to the same word: (“to laugh” in Slovenian): smej, smejal, smejala, smejale, smejali, smejalo, smejati, smejejo, smejeta, smejete, smejeva, smejes, smejemo, smejis, smeje, smejoc, smejta, smejte, smejva

In Chinese…… 一切尽在不言中


Outline





Measures of Text Retrieval

Let us suppose that an IR system returned a set of documents to the user’s query.

We define measures that allow to evaluate how accurate (correct) the system’s answer was.

Two types of documents can be found in a database:– relevant documents, which are relevant to the user’s query– retrieved documents, which are returned to the user by the

system


• Precision– evaluates ability to retrieve documents that are mostly

relevant

• Recall– evaluates ability of the search to find all of the relevant items

in the corpus.

documents relevant of number Total

X recall

documents retrieved of number Total

X precision

Precision and Recall Relevant

documents

Retrieved documents

Entire collection of documents

x

retrieved not retrievedirrelevant retrieved and irrelevant not retrieved and irrelevantrelevant X = retrieved and relevant not retrieved and relevant


Precision and Recall

• Trade-off between precision and recall

1

recall

pre

cis

ion

the ideal case

returns relevant documents but

misses many of them

1

returns most relevantdocuments but includes

lots of unwanted documents

0


Computing Recall

• Number of relevant documents is often not available, and thus we use techniques to estimate it, such as

– sampling across the database and performing relevance judgment on the documents

– applying different retrieval algorithms to the same database for the same query

• the relevant documents are the aggregate of all found documents

– the generated list is a golden standard to compute recall


Computing Recall and Precision

• For a given query– generate the ranked list of retrieved documents

– adjust a threshold on the ranked list to generate different sets of retrieved documents, thus with different recall/precision measures

• mark each document in the ranked list that is relevant according to the gold standard

• compute recall and precision for each position in the ranked list that contains a relevant document

still missing one relevant document, and thus will never reach 100% recall

rank doc # relevant? precision/recall

1 134 yes P = 1/1 = 1, R = 1/7 = 0.14

2 1987 yes P = 2/2 = 1, R = 2/7 = 0.29

3 21

4 8712 yes P = 3/4 = 0.75, R = 3/7 = 0.43

5 112

6 567 yes P = 4/6 = 0.67, R = 4/7 = 0.57

7 810

8 12 yes P = 5/8 = 0.63, R = 5/7 = 0.71

9 346

10 478

11 7834 yes P = 6/11 = 0.55, R = 6/7 = 0.86

12 3412

total # of relevant docs = 7

Precision

Recall

34

R=3/6=0.5; P=3/4=0.75

Computing Recall/Precision Points: Example 1

n doc # relevant

1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990

Let total # of relevant docs = 6Check each new recall point:

R=1/6=0.167; P=1/1=1

R=2/6=0.333; P=2/2=1

R=5/6=0.833; p=5/13=0.38

R=4/6=0.667; P=4/6=0.667

Missing one relevant document.

Never reach 100% recall

Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)

35

R=3/6=0.5; P=3/5=0.6

Computing Recall/Precision Points: Example 2

n doc # relevant

1 588 x2 5763 589 x4 3425 590 x6 7177 9848 772 x9 321 x10 49811 11312 62813 77214 592 x

Let total # of relevant docs = 6Check each new recall point:

R=1/6=0.167; P=1/1=1

R=2/6=0.333; P=2/3=0.667

R=6/6=1.0; p=6/14=0.429

R=4/6=0.667; P=4/8=0.5

R=5/6=0.833; P=5/9=0.556


36

Compare Two or More Systems

• The curve closest to the upper right-hand corner of the graph indicates the best performance


37

R- Precision

• Precision at the R-th position in the ranking of results for a query that has R relevant documents.

n doc # relevant

1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990

R = # of relevant docs = 6

R-Precision = 4/6 = 0.67


38

F-Measure

• One measure of performance that takes into account both recall and precision.

• Harmonic mean of recall and precision:

• Compared to arithmetic mean, both need to be high for harmonic mean to be high.

PRRP

PRF 11

22


39

E Measure (parameterized F Measure)

• A variant of F measure that allows weighting emphasis on precision over recall:

• Value of controls trade-off: = 1: Equally weight precision and recall (E=F). > 1: Weight recall more. < 1: Weight precision more.

PRRP

PRE

1

2

2

2

2

)1()1(


Mean Average Precision(MAP)

• Average Precision: Average of the precision values at the points at which each relevant document is retrieved.– Ex1: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633– Ex2: (1 + 0.667 + 0.6 + 0.5 + 0.556 + 0.429)/6 = 0.625

• Mean Average Precision: Average of the average precision value for a set of queries.

40Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)

Non-Binary Relevance

• Documents are rarely entirely relevant or non-relevant to a query

• Many sources of graded relevance judgments– Relevance judgments on a 5-point scale– Multiple judges– Click distribution and deviation from expected

levels (but click-through != relevance judgments)


Cumulative Gain

• With graded relevance judgments, we can compute the gain at each rank.

• Cumulative Gain at rank n:

(Where reli is the graded relevance of the document at position i)



Outline





How to Measure Text Similarity?

It is a well-studied problem– metrics use a “bag of words” model

• it completely ignores word order and syntactic structure

• it treats both document and query as a bag of independent words– common “stop words” are removed– words are stemmed to reduce them to their root form

• the preprocessed words are called terms

– vector-space model is used to calculate similarity measure between documents and a query, and between two documents


The Vector-Space Model

Assumptions:– Vocabulary: a set of all distinct terms that remain after

preprocessing documents in the database; it contains t index terms

• these “orthogonal” terms form a vector space

– each term, i, in either a document or query, j, is given a real-valued weight wij.

• documents and queries are expressed as t-dimensional vectorsdj = (w1j, w2j, …, wtj)


3D Example of the Vector-Space Model

T3

D1 = 2T1+ 6T2 + 5T3

D2 = 5T1 + 5T2 + 2T3

Q = 0T1 + 0T2 + 2T3

5

6

2 5

2

5

T2

T1

Example– document D1 = 2T1 + 6T2 + 5T3

– document D2 = 5T1 + 5T2 + 2T3

– query Q1 = 0T1 + 0T2 + 2T3

– which document iscloser to query?

• how to measure it?– Distance?– Angle?– Projection?


The Vector-Space Model

• Collection of n documents– represented in the vector-space model by a term-document

matrix– a cell in the matrix corresponds to the weight of a term in

the document• value of zero means that the term does not exist in the

document

• Next we explain how the weights are computed

T1 T2 … Tt

D1 w11 w21 … wt1

D2 w12 w22 … wt2

: : : : : : : :Dn w1n w2n … wtn


Term Weights

Frequency of a term– more frequent terms in a document are more important

• they are more indicative of the topic of a document

fij = frequency of term i in document j

– the frequency is normalized by dividing by the frequency of the most common term in the document

tfij = fij / maxi(fij)


Term WeightsInverse document frequency

– used to indicate the term’s discriminative power• terms that appear in many different documents are less indicative for a specific

topic

dfi = document frequency of term i = # of documents containing term I

idfi = inverse document frequency of term i = log2 (N / dfi)

where N is the total # of documents in the database, and log2 is used to dampen the effect relative to tfij


TF-IDF WeightingTerm frequency-inversed document frequency (tf-idf)

weighting

wij = tfij idfi = tfij log2 (N / dfi)

– the highest weight is assigned to terms that occur frequently in the document but rarely in the rest of the database

• some other ways of determining term weights have also been proposed

– the tf-idf weighting was found to work very well through extensive experimentations, and thus it is widely used


Consider the following document:“data cube contains x data dimension, y data dimension, and z data dimension”

where grey color stands for “ignored” letters, and other colors indicate distinct terms

– the frequencies are: data (4), dimension (3), cube (1), contain (1)

– we assume that the entire collection contains 10,000 documents and the document frequencies of these four terms are

data (1300), dimension (250), cube (50), contain (3100)– then:data: tf = 4/4 idf = log2(10000/1300) = 2.94 tf-idf = 2.94dimension: tf = 3/4 idf = log2(10000/250) = 5.32 tf-idf = 3.99cube: tf = 1/4 idf = log2(10000/50) = 7.64 tf-idf = 1.91contain: tf = 1/4 idf = log2(10000/3100) = 1.69 tf-idf = 0.42

Example of TF-IDF Weighting


Outline





Text Similarity Measure

Text similarity measure is a function used to compute degree of similarity between two vectors

– usually is used to measure similarity between the query and each of the documents from the database

• it is used to rank the documents– the order indicates their relevance to the query

• usually a threshold is used to control the number of retrieved relevant documents


Is defined as the cosine of theangle between two vectors

where t is # of terms

– exampleD1 = 2T1 + 6T2 + 5T3, D2 = 5T1 + 5T2 + 2T3, Q1 = 0T1 + 0T2 + 2T3

document D1 is twice more similar to the query than document D2

Cosine Similarity Measure

27.0454

4

)400()42525(

)220505(),( 2

QDsimilarity

T3

D1 = 2T1+ 6T2 + 5T3

D2 = 5T1 + 5T2 + 2T3Q = 0T1 + 0T2 + 2T3

5

6

2 5

2

5

T2

T1

t

i

t

i

t

ij

ww

ww

qd

qd

iqij

iqij

j

jqdsimilarity

1 1

22

1

)(),(

62.0465

10

)400()25364(

)250602(),( 1

QDsimilarity


Outline





If we have extra time…I can further talk about:

• Different levels of text mining

(word-> sentence -> segment -> …)

• The application of text mining in other areas

(Spam and malicious executables detection)

56


Outline


– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Levels of Text Mining– Misc



Improving IR Systems

The basic design of the IR systems described so far can be enhanced to increase precision and recall, and to improve the term-document matrix

– The former issue is often addressed by using relevance feedback mechanism, while the latter is addressed by using latent semantic indexing



Latent Semantic Indexing

– improves effectiveness of the IR system by retrieving documents that are more relevant to a user's query through manipulation of the term-document matrix• original term-document matrix is often too large for the available

computing resources, is presumed noisy (some anecdotal occurrences of terms should be eliminated) and too sparse with respect to the "true" term-document matrix

– therefore, the original matrix is approximated by a smaller, “de-noisified” matrix• the new matrix is computed using singular value decomposition

(SVD) technique (Chapter 7)



Relevance Feedback– modification of the

information searchprocess to improveaccuracy

– it adds terms to the initial query that may match relevant documents

1. perform IR search on the initial query

2. get feedback from the user as to what documents are relevant and find new (with respect to the initial query) terms from known relevant documents

3. add new terms to the initial query to form a new query

4. repeat the search using the new query

5. return a set of relevant documents based on the new query

6. user evaluates the returned documents

IR system

user

query new query

new terms

Matching documents

1

2

3

4

5

6

3



Relevance Feedback

– performed manually• user manually identifies relevant documents, and new terms are

selected either manually or automatically

– performed automatically• relevant documents are identified automatically by using the top

ranked documents, and new terms are selected automatically1. identification of the N top-ranked documents2. identification of all terms from the N top-ranked documents3. selection of the feedback terms4. merging the feedback terms with the original query5. identifying the top-ranked documents for the modified query



Relevance Feedback

• Pros– it usually improves average precision by increasing the

number of good terms in the query

• Cons– requires more computational work– it can decrease effectiveness

• one bad new word can undo the good caused by lots of good words


References

Baeza-Yates, R. and Ribeiro-Neto, B 1999. Modern Information Retrieval, Addison Wesley

Feldman, R. and Sanger, J. 2006. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press

Hotho, A., Nürnberger, A. and Paaß, G. 2005. A Brief Survey of Text Mining, GLDV-Journal for Computational Linguistics and Language Technology, 20(1):19-62

Salton, G. and McGill, M. 1982. Introduction to Modern Information Retrieval, McGraw-Hill

© 2007 cios / pedrycz / swiniarski / kurgan chapter 14 text mining cios / pedrycz / swiniarski /...

Documents

words disease

wordsstop words

words brain

cios pedrycz swiniarski

bag of words

customdefined words

groups of corresponding

noun phrases