© 2007 cios / pedrycz / swiniarski / kurgan chapter 14 text mining cios / pedrycz / swiniarski /...
TRANSCRIPT
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
Chapter 14TEXT MINING
Cios / Pedrycz / Swiniarski / KurganCios / Pedrycz / Swiniarski / Kurgan
Presented by: Yulong ZhangNov. 16, 2011
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 2
Outline
• Introduction• Information Retrieval
– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Misc
• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
What we have learned – structured data
3
Feature extractionFeature selectionDiscretizationClusteringClassification……
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
But what should we do for these data?
4
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 5
Introduction
So far we have focused on data mining methods for analysis and extraction of useful knowledge from structured data such as flat files, relational, transactional etc. data.
In contrast, we are text mining is concerned with analysis of text databases that consist of mainly semi-structured or unstructured data such as:
– collections of articles, research papers, e-mails, blogs, discussion forums, and WWW pages
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 6
Introduction
Semi-structured data is neither completely structured (e.g., relational table) nor completely unstructured (e.g., free text)
– semi-structured document has:
• some structured fields such as title, list of authors, keywords, publication date, category, etc.
• and some unstructured fields such as abstract and contents
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
Important differences between the two:
• The number of features
• Features of semi-structured data are sparse
• Growing speed in size (only a tiny portion is relative and even less is useful)
7
Structured data Semi-structured data
~100 - ~102 ~103 of features
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
8
Introduction
There are three main types of retrieval within the knowledge discovery process framework:
– data retrieval - retrieval of structured data from DBMS and data warehouses (Chapter 6)
– information retrieval - concerns organization and retrieval of information from large collections of semi-structured or unstructured text-based databases and the web
– knowledge retrieval - generation of knowledge from (usually) structured data (Chapters 9-13)
SELECT * FROM xx WHERE yy = zzSELECT * FROM xx WHERE yy = zz
IF xx THEN yy ELSE zzIF xx THEN yy ELSE zz
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 9
Information Retrieval
...[the] actions, methods and procedures for recovering stored data to provide information on a given subject
ISO 2382/1 (1984)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 10
Information RetrievalDefinitions
– database• a collection of documents
– document• a sequence of terms in a natural language that expresses ideas
about some topic
– term• a semantic unit, a word, phrase, or root of a word
– query• a request for documents that cover a particular topic of interest to
the user of an IR system
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 11
ObamaObama
TermDocument
Database
Next presidentNext president
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 12
Information RetrievalDefinitions
– IR system• its goal is to find relevant documents in response to the user’s
request
• performs matching between the language of the query and the language of the document
– simple word matching does not work since the same word has many different semantic meanings
» e.g. MAKE
“to make a mistake”
“make of a car”
“to make up excuses”
“to make for an exit”
“it is just a make-believe”
Also, one word may have many morphological variants: make, makes, made, making
Also, one word may have many morphological variants: make, makes, made, making
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 13
Information RetrievalDefinitions
• other problems– consider query “Abraham Lincoln”
should it return document that contains the following sentences: “Abraham owns a Lincoln. It is a great car.”?
– consider query “what is the funniest movie ever made”: how can the IR system know what the user’s idea of a funny movies is?
• difficulties include inherent properties of natural language, high expectations of the user, etc.
– a number of mechanisms were developed to cope with them
No wait! Data mining is an interesting course!Wait! Is data mining an interesting course? No! No wait! Data mining is an interesting course!Wait! Is data mining an interesting course? No!
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 14
Information Retrieval
Cannot provide one “best” answer to the user
– many algorithms provide one “correct” answer, such as SELECT Price FROM Sales WHERE Item = “book”
or: find shortest path from node A to node B
– IR on the other hand provides a range of possibly best answers and lets the user to choose
• query “Lincoln” may return information about– Abraham Lincoln
– Lincoln dealership
– The Lincoln Memorial
– The University of Nebraska-Lincoln
– The Lincoln University in New Zealand
IR systems do not give just one right answer but perform an approximate search that returns multiple, potentially correct answers.
IR systems do not give just one right answer but perform an approximate search that returns multiple, potentially correct answers.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 15
Information Retrieval
IR system provides information based on the stored data
– the key is to provide some measurement of relevance between the stored data and the user’s query
• i.e., the relation between requested information and retrieved information
• given a query, the IR system has to check whether the stored information is relevant to the query
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 16
Information Retrieval
IR system provides information based on the stored data
– IR systems use heuristics to find relevant information• they find a “close” answer and use heuristic to measure its
closeness to the “right” answer
• the inherent difficulty is that very often we do not know what the right answer is!
– we just measure how close to the right answer we can come
– the solution involves using measures of precision and recall• they are used to measure the “accuracy” of IR systems
Will be discussed laterWill be discussed later
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 17
Outline
• Introduction• Information Retrieval
– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Misc
• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 18
Architecture of IR Systems
D1: Lincoln Park Zoo is everyone’s zoo …D2: This website includes a biography, photographs, and lots …D3: Biography of Abraham Lincoln, the sixteenth President …
Welcome to Lincoln Center for Perfor …
D1: <DOC> <DOCNO> 1</DOCNO><TEXT>Lincoln Park Zoo is every …</DOC>D2: <DOC> <DOCNO> 2</DOCNO><TEXT>This website includes a biography, photographs, …</DOC> ….
sourcedocuments
textualdata
taggeddata
invertedindex
Welcome to the Univers. of Nebraska
March 31, 1849Lincoln returns …
lincoln: D1, D2, D13, D54, …zoo: D2, D43, D198, …website: D1, D2, D3, D4, …university: D4, D8, D14, … …
where is the University of Nebraska Lincoln?
query
user
whereuniversitynebraskalincoln
transformedquery
similaritymeasure
similarity (D1, query) = 0.15similarity (D2, query) = 0.10similarity (D3, query) = 0.14similarity (D4, query) = 0.75…
list ofrankeddocuments
document D4document D52document D12document D134…
database
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 19
Architecture of IR Systems
– Search database• is organized as an inverted index file of the significant
character strings that occur in the stored tagged data– the inverted file specifies where the strings occur in the text
– the strings include words after excluding determiners, conjunctions, and prepositions - known as STOP-WORDS
» determiner is a non-lexical element preceding a noun in a noun phrase, e.g., the, that, two, a, many, all
» conjunction is used to combine multiple sentences, e.g., and, or » preposition links nouns, pronouns, and phrases to other words in a
sentence, e.g., on, beneath, over, of, during, beside
– the stored words use a common form» they are stemmed to obtain the ROOT FORM by removing common
prefixes and suffixes» synonyms to a given word are also found and used
Disease, diseased, diseases, illness, unwellness, malady, sickness, …Disease, diseased, diseases, illness, unwellness, malady, sickness, …
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 20
Architecture of IR Systems
– Query• is composed of character strings combined by Boolean operators
and additional features such as contextual or positional operators– query is also preprocessed in terms of removing determiners,
conjunctions, and prepositions
– No linguistic analysis of the semantics of the stored texts, or of the queries, is performed
• thus the IR systems are domain-independent
How does it fit in the CS taxonomy?
Computers
Artificial Intelligence AlgorithmsDatabases Networking
Robotics SearchNatural Language Processing
InformationRetrieval
Machine Translation
Language Analysis
Semantics Parsing
By Rada Mihalcea, “Natural Language Processing”
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 22
Outline
• Introduction• Information Retrieval
– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Misc
• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 23
Linguistic Preprocessing
Creation of the inverted index requires linguistic preprocessing, which aims at extracting important terms from a document represented as the bag of words.
Term extraction involves two main operations:– Removal of stop words– Stemming
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 24
Linguistic Preprocessing
• Removal of stop words– stop words are defined as terms that are irrelevant although
they occur frequently in the documents: • determiner is a non-lexical element preceding a noun in a noun
phrase, and includes articles (a, an, the), demonstratives, when used with noun phrases (this, that, these, those), possessive determiners (her, his, its, my, our, their, your) and quantifiers (all, few, many, several, some, every)
• conjunction is a part of speech that is used to combine two words, phrases, or clauses together, and includes coordinating conjunctions (for, and, nor, but, or, yet, so), correlative conjunctions (both … and, either … or, not (only) … but (… also)), and subordinating conjunctions (after, although, if, unless, because)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 25
Linguistic Preprocessing
• Removal of stop words– stop words are defined as terms that are although they may
occur frequently in the documents: • preposition links nouns, pronouns and phrases to other words
in a sentence (on, beneath, over, of, during, beside, etc.)
• Finally, the stop words include some custom-defined words, which are related to the subject of the database
e.g., for a database that lists all research papers related to brain modeling, the words brain and model should be removed
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 26
Linguistic Preprocessing
• Stemming– words that appear in documents often have many
morphological variants– each word that is not a stop word is reduced into its
corresponding stem word (term)• words are stemmed to obtain root form by removing common
prefixes and suffixes
• in this way, we can identify groups of corresponding words where the words in the group are syntactical variants of each other, and collect only one word per group
– for instance, words disease, diseases, and diseased share a common stem term disease, and can be treated as different occurrences of this word
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 27
•For English it is not a big problem - publicly available algorithms give good results. Most widely used is Porter stemmer at http://www.tartarus.org/~martin/PorterStemmer/
E.g. in Slovenian language 10-20 different forms correspond to the same word: (“to laugh” in Slovenian): smej, smejal, smejala, smejale, smejali, smejalo, smejati, smejejo, smejeta, smejete, smejeva, smejes, smejemo, smejis, smeje, smejoc, smejta, smejte, smejva
In Chinese…… 一切尽在不言中
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 28
Outline
• Introduction• Information Retrieval
– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Misc
• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 29
Measures of Text Retrieval
Let us suppose that an IR system returned a set of documents to the user’s query.
We define measures that allow to evaluate how accurate (correct) the system’s answer was.
Two types of documents can be found in a database:– relevant documents, which are relevant to the user’s query– retrieved documents, which are returned to the user by the
system
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 30
• Precision– evaluates ability to retrieve documents that are mostly
relevant
• Recall– evaluates ability of the search to find all of the relevant items
in the corpus.
documents relevant of number Total
X recall
documents retrieved of number Total
X precision
Precision and Recall Relevant
documents
Retrieved documents
Entire collection of documents
x
retrieved not retrievedirrelevant retrieved and irrelevant not retrieved and irrelevantrelevant X = retrieved and relevant not retrieved and relevant
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 31
Precision and Recall
• Trade-off between precision and recall
1
recall
pre
cis
ion
the ideal case
returns relevant documents but
misses many of them
1
returns most relevantdocuments but includes
lots of unwanted documents
0
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 32
Computing Recall
• Number of relevant documents is often not available, and thus we use techniques to estimate it, such as
– sampling across the database and performing relevance judgment on the documents
– applying different retrieval algorithms to the same database for the same query
• the relevant documents are the aggregate of all found documents
– the generated list is a golden standard to compute recall
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 33
Computing Recall and Precision
• For a given query– generate the ranked list of retrieved documents
– adjust a threshold on the ranked list to generate different sets of retrieved documents, thus with different recall/precision measures
• mark each document in the ranked list that is relevant according to the gold standard
• compute recall and precision for each position in the ranked list that contains a relevant document
still missing one relevant document, and thus will never reach 100% recall
rank doc # relevant? precision/recall
1 134 yes P = 1/1 = 1, R = 1/7 = 0.14
2 1987 yes P = 2/2 = 1, R = 2/7 = 0.29
3 21
4 8712 yes P = 3/4 = 0.75, R = 3/7 = 0.43
5 112
6 567 yes P = 4/6 = 0.67, R = 4/7 = 0.57
7 810
8 12 yes P = 5/8 = 0.63, R = 5/7 = 0.71
9 346
10 478
11 7834 yes P = 6/11 = 0.55, R = 6/7 = 0.86
12 3412
total # of relevant docs = 7
Precision
Recall
34
R=3/6=0.5; P=3/4=0.75
Computing Recall/Precision Points: Example 1
n doc # relevant
1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990
Let total # of relevant docs = 6Check each new recall point:
R=1/6=0.167; P=1/1=1
R=2/6=0.333; P=2/2=1
R=5/6=0.833; p=5/13=0.38
R=4/6=0.667; P=4/6=0.667
Missing one relevant document.
Never reach 100% recall
Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
35
R=3/6=0.5; P=3/5=0.6
Computing Recall/Precision Points: Example 2
n doc # relevant
1 588 x2 5763 589 x4 3425 590 x6 7177 9848 772 x9 321 x10 49811 11312 62813 77214 592 x
Let total # of relevant docs = 6Check each new recall point:
R=1/6=0.167; P=1/1=1
R=2/6=0.333; P=2/3=0.667
R=6/6=1.0; p=6/14=0.429
R=4/6=0.667; P=4/8=0.5
R=5/6=0.833; P=5/9=0.556
Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
36
Compare Two or More Systems
• The curve closest to the upper right-hand corner of the graph indicates the best performance
Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
37
R- Precision
• Precision at the R-th position in the ranking of results for a query that has R relevant documents.
n doc # relevant
1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990
R = # of relevant docs = 6
R-Precision = 4/6 = 0.67
Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
38
F-Measure
• One measure of performance that takes into account both recall and precision.
• Harmonic mean of recall and precision:
• Compared to arithmetic mean, both need to be high for harmonic mean to be high.
PRRP
PRF 11
22
Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
39
E Measure (parameterized F Measure)
• A variant of F measure that allows weighting emphasis on precision over recall:
• Value of controls trade-off: = 1: Equally weight precision and recall (E=F). > 1: Weight recall more. < 1: Weight precision more.
PRRP
PRE
1
2
2
2
2
)1()1(
Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
Mean Average Precision(MAP)
• Average Precision: Average of the precision values at the points at which each relevant document is retrieved.– Ex1: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633– Ex2: (1 + 0.667 + 0.6 + 0.5 + 0.556 + 0.429)/6 = 0.625
• Mean Average Precision: Average of the average precision value for a set of queries.
40Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
Non-Binary Relevance
• Documents are rarely entirely relevant or non-relevant to a query
• Many sources of graded relevance judgments– Relevance judgments on a 5-point scale– Multiple judges– Click distribution and deviation from expected
levels (but click-through != relevance judgments)
41Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
Cumulative Gain
• With graded relevance judgments, we can compute the gain at each rank.
• Cumulative Gain at rank n:
(Where reli is the graded relevance of the document at position i)
42Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 43
Outline
• Introduction• Information Retrieval
– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Misc
• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 44
How to Measure Text Similarity?
It is a well-studied problem– metrics use a “bag of words” model
• it completely ignores word order and syntactic structure
• it treats both document and query as a bag of independent words– common “stop words” are removed– words are stemmed to reduce them to their root form
• the preprocessed words are called terms
– vector-space model is used to calculate similarity measure between documents and a query, and between two documents
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 45
The Vector-Space Model
Assumptions:– Vocabulary: a set of all distinct terms that remain after
preprocessing documents in the database; it contains t index terms
• these “orthogonal” terms form a vector space
– each term, i, in either a document or query, j, is given a real-valued weight wij.
• documents and queries are expressed as t-dimensional vectorsdj = (w1j, w2j, …, wtj)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 46
3D Example of the Vector-Space Model
T3
D1 = 2T1+ 6T2 + 5T3
D2 = 5T1 + 5T2 + 2T3
Q = 0T1 + 0T2 + 2T3
5
6
2 5
2
5
T2
T1
Example– document D1 = 2T1 + 6T2 + 5T3
– document D2 = 5T1 + 5T2 + 2T3
– query Q1 = 0T1 + 0T2 + 2T3
– which document iscloser to query?
• how to measure it?– Distance?– Angle?– Projection?
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 47
The Vector-Space Model
• Collection of n documents– represented in the vector-space model by a term-document
matrix– a cell in the matrix corresponds to the weight of a term in
the document• value of zero means that the term does not exist in the
document
• Next we explain how the weights are computed
T1 T2 … Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : : : : : :Dn w1n w2n … wtn
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 48
Term Weights
Frequency of a term– more frequent terms in a document are more important
• they are more indicative of the topic of a document
fij = frequency of term i in document j
– the frequency is normalized by dividing by the frequency of the most common term in the document
tfij = fij / maxi(fij)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 49
Term WeightsInverse document frequency
– used to indicate the term’s discriminative power• terms that appear in many different documents are less indicative for a specific
topic
dfi = document frequency of term i = # of documents containing term I
idfi = inverse document frequency of term i = log2 (N / dfi)
where N is the total # of documents in the database, and log2 is used to dampen the effect relative to tfij
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 50
TF-IDF WeightingTerm frequency-inversed document frequency (tf-idf)
weighting
wij = tfij idfi = tfij log2 (N / dfi)
– the highest weight is assigned to terms that occur frequently in the document but rarely in the rest of the database
• some other ways of determining term weights have also been proposed
– the tf-idf weighting was found to work very well through extensive experimentations, and thus it is widely used
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 51
Consider the following document:“data cube contains x data dimension, y data dimension, and z data dimension”
where grey color stands for “ignored” letters, and other colors indicate distinct terms
– the frequencies are: data (4), dimension (3), cube (1), contain (1)
– we assume that the entire collection contains 10,000 documents and the document frequencies of these four terms are
data (1300), dimension (250), cube (50), contain (3100)– then:data: tf = 4/4 idf = log2(10000/1300) = 2.94 tf-idf = 2.94dimension: tf = 3/4 idf = log2(10000/250) = 5.32 tf-idf = 3.99cube: tf = 1/4 idf = log2(10000/50) = 7.64 tf-idf = 1.91contain: tf = 1/4 idf = log2(10000/3100) = 1.69 tf-idf = 0.42
Example of TF-IDF Weighting
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 52
Outline
• Introduction• Information Retrieval
– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Misc
• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 53
Text Similarity Measure
Text similarity measure is a function used to compute degree of similarity between two vectors
– usually is used to measure similarity between the query and each of the documents from the database
• it is used to rank the documents– the order indicates their relevance to the query
• usually a threshold is used to control the number of retrieved relevant documents
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 54
Is defined as the cosine of theangle between two vectors
where t is # of terms
– exampleD1 = 2T1 + 6T2 + 5T3, D2 = 5T1 + 5T2 + 2T3, Q1 = 0T1 + 0T2 + 2T3
document D1 is twice more similar to the query than document D2
Cosine Similarity Measure
27.0454
4
)400()42525(
)220505(),( 2
QDsimilarity
T3
D1 = 2T1+ 6T2 + 5T3
D2 = 5T1 + 5T2 + 2T3Q = 0T1 + 0T2 + 2T3
5
6
2 5
2
5
T2
T1
t
i
t
i
t
ij
ww
ww
qd
qd
iqij
iqij
j
jqdsimilarity
1 1
22
1
)(),(
62.0465
10
)400()25364(
)250602(),( 1
QDsimilarity
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 55
Outline
• Introduction• Information Retrieval
– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Misc
• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
If we have extra time…I can further talk about:
• Different levels of text mining
(word-> sentence -> segment -> …)
• The application of text mining in other areas
(Spam and malicious executables detection)
56
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 57
Outline
• Introduction• Information Retrieval
– Definition– Architecture of IR Systems– Linguistic Preprocessing– Measures of Text Retrieval– Vector-Space Model– Text Similarity Measures– Levels of Text Mining– Misc
• Improving Information Retrieval Systems– Latent Semantic Indexing– Relevance Feedback
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 58
Improving IR Systems
The basic design of the IR systems described so far can be enhanced to increase precision and recall, and to improve the term-document matrix
– The former issue is often addressed by using relevance feedback mechanism, while the latter is addressed by using latent semantic indexing
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 59
Improving IR Systems
Latent Semantic Indexing
– improves effectiveness of the IR system by retrieving documents that are more relevant to a user's query through manipulation of the term-document matrix• original term-document matrix is often too large for the available
computing resources, is presumed noisy (some anecdotal occurrences of terms should be eliminated) and too sparse with respect to the "true" term-document matrix
– therefore, the original matrix is approximated by a smaller, “de-noisified” matrix• the new matrix is computed using singular value decomposition
(SVD) technique (Chapter 7)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 60
Improving IR Systems
Relevance Feedback– modification of the
information searchprocess to improveaccuracy
– it adds terms to the initial query that may match relevant documents
1. perform IR search on the initial query
2. get feedback from the user as to what documents are relevant and find new (with respect to the initial query) terms from known relevant documents
3. add new terms to the initial query to form a new query
4. repeat the search using the new query
5. return a set of relevant documents based on the new query
6. user evaluates the returned documents
IR system
user
query new query
new terms
Matching documents
1
2
3
4
5
6
3
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 61
Improving IR Systems
Relevance Feedback
– performed manually• user manually identifies relevant documents, and new terms are
selected either manually or automatically
– performed automatically• relevant documents are identified automatically by using the top
ranked documents, and new terms are selected automatically1. identification of the N top-ranked documents2. identification of all terms from the N top-ranked documents3. selection of the feedback terms4. merging the feedback terms with the original query5. identifying the top-ranked documents for the modified query
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 62
Improving IR Systems
Relevance Feedback
• Pros– it usually improves average precision by increasing the
number of good terms in the query
• Cons– requires more computational work– it can decrease effectiveness
• one bad new word can undo the good caused by lots of good words
© 2007 Cios / Pedrycz / Swiniarski / Kurgan 63
References
Baeza-Yates, R. and Ribeiro-Neto, B 1999. Modern Information Retrieval, Addison Wesley
Feldman, R. and Sanger, J. 2006. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press
Hotho, A., Nürnberger, A. and Paaß, G. 2005. A Brief Survey of Text Mining, GLDV-Journal for Computational Linguistics and Language Technology, 20(1):19-62
Salton, G. and McGill, M. 1982. Introduction to Modern Information Retrieval, McGraw-Hill