text operations j. h. wang feb. 21, 2008. the retrieval process user interface text operations query...

44
Text Operations J. H. Wang Feb. 21, 2008

Upload: phyllis-douglas

Post on 18-Jan-2016

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Text Operations

J. H. WangFeb. 21, 2008

Page 2: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

The Retrieval ProcessUserInterface

Text Operations

Query Operations

Indexing

Searching

Ranking

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

4, 10

6, 7

5 8

2

8

Text Database

Text

Page 3: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Outline

• Document Preprocessing (7.1-7.2)• Text Compression (7.4-7.5): skipped• Automatic Indexing (Chap. 9, Salton)

– Term Selection

Page 4: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Document Preprocessing

• Lexical analysis– Letters, digits, punctuation marks, …

• Stopword removal– “the”, “of”, …

• Stemming– Prefix, suffix

• Index term selection– Noun

• Construction of term categorization structure– Thesaurus

Page 5: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

• Logical view of the documents

structure

Accents,spacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text Index terms

Page 6: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Lexical Analysis

• Converting a stream of characters into a stream of words– Recognition of words– Digits: usually not good index terms

• Ex.: The number of deaths due to car accidents between 1910 and 1989, “510B.C.”, credit card numbers, …

– Hyphens• Ex: state-of-the-art, gilt-edge, B-49, …

– Punctuation marks: normally removed entirely• Ex: 510B.C., program codes: x.id vs. xid, …

– The case of letters: usually not important• Ex: Bank vs. bank, Unix-like operating systems, …

Page 7: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Elimination of Stopwords• Stopwords: words which are too freque

nt among the documents in the collection are not good discriminators– Articles, prepositions, conjunctions, …– Some verbs, adverbs, and adjectives

• To reduce the size of the indexing structure

• Stopword removal might reduce recall– Ex: “to be or not to be”

Page 8: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Stemming

• The substitution of the words by their respective stems– Ex: plurals, gerund forms, past tense suffixes, …

• A stem is the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes)– Ex: connect, connected, connecting,

connection, connections

• Controversy about the benefits– Useful for improving retrieval performance– Reducing the size of the indexing structure

Page 9: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Stemming

• Four types of stemming strategies– Affix removal, table lookup, successor

variety, and n-grams (or term clustering)

• Suffix removal– Port’s algorithm (available in the

Appendix)• Simplicity and elegance

Page 10: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Index Term Selection

• Manually or automatically• Identification of noun groups

– Most of the semantics is carried by the noun words

– Systematic elimination of verbs, adjectives, adverbs, connectives, articles, and pronouns

– A noun group is a set of nouns whose syntactic distance in the text does not exceed a predefined threshold

Page 11: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Thesauri

• Thesaurus: a reference to a treasury of words– A precompiled list of important words in

a given domain of knowledge– For each word in this list, a set of related

words• Ex: synonyms, …

– It also involves normalization of vocabulary, and a structure

Page 12: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Example Entry in Peter Roget’s Thesaurus

• Cowardly adjective• Ignobly lacking in courage: cowardly tur

ncoats.• Syns: chicken (slang), chicken-hearted,

craven, dastardly, faint-hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow-bellied (slang).

Page 13: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Main Purposes of a Thesaurus

• To provide a standard vocabulary for indexing and searching

• To assist users with locating terms for proper query formulation

• To provide classified hierarchies that allow the broadening and narrowing of the current query request according to the user needs

Page 14: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Motivation for Building a Thesaurus

• Using a controlled vocabulary for the indexing and searching– Normalization of indexing concepts– Reduction of noise– Identification of indexing terms with a

clear semantic meaning– Retrieval based on concepts rather than

on words• Ex: term classification hierarchy in

Yahoo!

Page 15: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Main Components of a Thesaurus

• Index terms: individual words, group of words, phrases– Concept

• Ex: “missiles, ballistic”

– Definition or explanation • Ex: seal (marine animals), seal (documents)

• Relationships among the terms– BT (broader), NT (narrower)– RT (related): much difficult

• A layout design for these term relationships– A list or bi-dimensional display

Page 16: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Automatic Indexing(Term Selection)

Page 17: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Automatic Indexing

• Indexing– assign identifiers (index terms) to text

documents

• Identifiers– single-term vs. term phrase– controlled vs. uncontrolled vocabularies

instruction manuals, terminological schedules, …

– objective vs. nonobjective text identifiers cataloging rules control, e.g., author names, publisher names, dates of publications, …

Page 18: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Two Issues

• Issue 1: indexing exhaustivity– exhaustive: assign a large number of terms– Nonexhaustive: only main aspects of subject conte

nt• Issue 2: term specificity

– broad terms (generic)cannot distinguish relevant from nonrelevant documents

– narrow terms (specific)retrieve relatively fewer documents, but most of them are relevant

Page 19: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

All docs

Recall vs. Precision

• Recall (R) = Number of relevant documents retrieved / total number of relevant documents in collection– The proportion of relevant items retrieved

• Precision (P) = Number of relevant documents retrieved / total number of documents retrieved– The proportion of items retrieved that are

relevant

• Example: for a query, e.g. TaipeiRetrieveddocs

Relevantdocs

Page 20: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

More on Recall/Precision

• Simultaneously optimizing both recall and precision is not normally achievable– Narrow and specific terms: precision is favored– Broad and nonspecific terms: recall is favored

• When a choice must be made between term specificity and term breadth, the former is generally preferable– High-recall, low-precision documents will

burden the user– Lack of precision is more easily remedied than

lack of recall

Page 21: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Term-Frequency Consideration

• Function words– for example, "and", "of", "or", "but", …– the frequencies of these words are high in

all texts• Content words

– words that actually relate to document content

– varying frequencies in the different texts of a collection

– indicate term importance for content

Page 22: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

A Frequency-Based Indexing Method

• Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words

• Compute the term frequency tfij for all remaining terms Tj in each document Di, specifying the number of occurrences of Tj in Di

• Choose a threshold frequency T, and assign to each document Di all term Tj for which tfij > T

Page 23: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

More on Term Frequency

• High-frequency term– Recall

• Ex: “Apple”

– But only if its occurrence frequency is not equally high in other documents

• Low-frequency term– Precision

• Ex: “Huntington’s disease”

– Able to distinguish the few documents in which they occur from the many from which they are absent

Page 24: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

How to Compute Weight wij ?• Inverse document frequency, idfj

– tfij*idfj (TFxIDF)

• Term discrimination value, dvj

– tfij*dvj

• Probabilistic term weighting trj

– tfij*trj

• Global properties of terms in a document collection

Page 25: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Inverse Document Frequency

• Inverse Document Frequency (IDF) for term Tj

where dfj (document frequency of term Tj) is thenumber of documents in which Tj occurs.

– fulfil both the recall and the precision– occur frequently in individual documents but

rarely in the remainder of the collection

idfN

dfj

j

log

Page 26: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

TFxIDF• Weight wij of a term Tj in a document di

• Eliminating common function words• Computing the value of wij for each term Tj in e

ach document Di

• Assigning to the documents of a collection all terms with sufficiently high (tf x idf) weights

w tfN

dfij ij

j

log

Page 27: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Term-discrimination Value

• Useful index terms– Distinguish the documents of a collection from

each other

• Document Space– Each point represents a particular document of

a collection– The distance between two points is inversely

proportional to the similarity between the respective term assignments

• When two documents are assigned very similar term sets, the corresponding points in document configuration appear close together

Page 28: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Original State After Assignment of good discriminator

After Assignment of poor discriminator

A Virtual Document Space

Page 29: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Good Term Assignment

• When a term is assigned to the documents of a collection, the few documents to which the term is assigned will be distinguished from the rest of the collection

• This should increase the average distance between the documents in the collection and hence produce a document space less dense than before

Page 30: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Poor Term Assignment

• A high frequency term is assigned that does not discriminate between the objects of a collection

• Its assignment will render the document more similar

• This is reflected in an increase in document space density

Page 31: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Term Discrimination Value

• Definitiondvj = Q - Qj

where Q and Qj are space densities before and after the assignment of term Tj

• The average pairwise similarity between all pairs of distinct terms:

• dvj>0, Tj is a good term; dvj<0, Tj is a poor term.

QN N

sim D Di kki k

N

i

N

1

1 11( )( , )

Page 32: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

DocumentFrequency

Low frequency

dvj=0Medium frequency

dvj>0

High frequency

dvj<0

N

Variations of Term-Discrimination Valuewith Document Frequency

Page 33: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

TFij x dvj

• wij = tfij x dvj• compared with

– : decreases steadily with increasing document frequency

– dvj: increases from zero to positive as the document frequency of the term increases,

decreases shapely as the document frequency becomes still larger

• Issue: efficiency problem to compute N(N-1) pairwise similarities

w tfN

dfij ij

j

log

N

df j

Page 34: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Document Centroid• Document centroid C = (c1, c2, c3, ..., c

t)

where wij is the j-th term in document I– A “dummy” average document located in

the center of the document space• Space density

N

iijj wc

1

N

iiDCsim

NQ

1

),(1

Page 35: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Probabilistic Term Weighting

• GoalExplicit distinctions between occurrences of terms in relevant and nonrelevant documents of a collection

• DefinitionGiven a user query q, and the ideal answer set of the relevant documentsFrom decision theory, the best ranking algorithm for a document D

)Pr(

)Pr(log

)|Pr(

)|Pr(log)(

nonrel

rel

nonrelD

relDDg

Page 36: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Probabilistic Term Weighting

• Pr(rel), Pr(nonrel):document’s a priori probabilities of relevance and nonrelevance

• Pr(D|rel), Pr(D|nonrel):occurrence probabilities of document D in the relevant and nonrelevant document sets

Page 37: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

t

ii

t

ii

nonrelxnonrelD

relxrelD

1

1

)|Pr()|Pr(

)|Pr()|Pr(

Assumptions

• Terms occur independently in documents

Page 38: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Derivation Process

)Pr(

)Pr(log

)|Pr(

)|Pr(log)(

nonrel

rel

nonrelD

relDDg

log

Pr( | )

Pr( | )

x rel

x nonrel

ii

t

ii

t1

1

constants

log

Pr( | )

Pr( | )

x rel

x nonreli

ii

t

1

constants

Page 39: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

• Given a document D=(d1, d2, …, dt)

• Assume di is either 0 (absent) or 1 (present)

Pr( | ) ( )

Pr( | ) ( )

x d rel p p

x d nonrel q q

i i i

d

i

d

i i i

d

i

d

i i

i i

1

1

1

1

Pr(xi=1|rel) = pi Pr(xi=0|rel) = 1-piPr(xi=1|nonrel) = qi Pr(xi=0|nonrel) = 1-qi

g Dx d rel

x d nonreli i

i ii

t

( ) logPr( | )

Pr( | )

1

constants

For a specific document D

Page 40: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

g Dx d rel

x d nonreli i

i ii

t

( ) logPr( | )

Pr( | )

1

constants

log( )

( )

d d

d d

i i

i i

p p

q q

i i

i ii

t1

11

11

constants

log( ) ( )

( ) ( )

d d

d d

i i

i i

p q p

q p qi i i

i i ii

t 1 1

1 11

constants

constantslog1 )1())1((

)1())1((

t

iiii

iii

qpq

pqpi

i

d

d

Page 41: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

trp q

q pj

j j

j j

log( )

( )

1

1

g Dp

qd

p q

q pi

ii

t

ii i

i ii

t

( ) log log( )

( )

1

1

1

11 1constants

Term Relevance Weight

Page 42: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Issue

• How to compute pj and qj ?

pj = rj / Rqj = (dfj-rj)/(N-R)

– rj: the number of relevant documents that contains term Tj

– R: the total number of relevant documents– N: the total number of documents

Page 43: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

Estimation of Term-Relevance

• The occurrence probability of a term in the nonrelevant documents qj is approximated by the occurrence probability of the term in the entire document collection

qj = dfj / N– Large majority of documents will be nonrelevant to

the average query• The occurrence probabilities of the terms in th

e small number of relevant documents is assumed to be equal by using a constant value pj = 0.5 for all j

Page 44: Text Operations J. H. Wang Feb. 21, 2008. The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text

5.0*

)1(*5.0log

)1(

)1(log

N

dfN

df

pq

qptr

j

j

jj

jjj

j

j

df

dfN )(log

When N is sufficiently large, N-dfj N,

j

jj

df

dfNtr

)(log

jdf

Nlog = idfj

Comparison