natural language processing topics in information retrieval august, 2002
DESCRIPTION
Text Categorization Attempt to assign documents to two or more pre-defined categories. Routing: Ranking of documents according to relevance. Training information in the form of relevance labels is available. Filtering: Absolute assessment of relevance.TRANSCRIPT
![Page 1: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/1.jpg)
Natural Language Processing
Topics in Information Retrieval
August, 2002
![Page 2: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/2.jpg)
Background on IR Retrieve textual information from
document repositories. User enters a query describing the desired
information The system returns a list of documents –
exact match, ranked list
![Page 3: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/3.jpg)
Text Categorization Attempt to assign documents to two or
more pre-defined categories. Routing: Ranking of documents according
to relevance. Training information in the form of relevance labels is available.
Filtering: Absolute assessment of relevance.
![Page 4: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/4.jpg)
Design Features of IR Systems Inverted Index:
Primary data structure of IR systems Data structure that lists each word and its frequency
in all documents. Including the position information allows us to search
for phrases. Stop List (Function Words):
Lists words unlikely to be useful for searching. Examples: the, from, to …. Excluding this reduces the size of the inverted index
![Page 5: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/5.jpg)
Design Features (Cont.) Stemming:
Simplified form of morphological analysis consisting simply of truncating a word.
For example laughing, laughs, laugh and laughed are all stemmed to laugh.
The problem is semantically different words like gallery and gall may both be truncated to gall making the stems unintelligible to users.
Levins and Porter Stemmer Thesaurus:
Widen search to include documents using related terms.
![Page 6: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/6.jpg)
Evaluation Measures Precision: Percentage of relevant items
returned. Recall: Percentage of all relevant documents in
the collection that is in the returned set. Combine precision and recall:
Cutoff Uninterpolated average precision Interpolated average precision Precision-Recall curves F measure
![Page 7: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/7.jpg)
Probability Ranking Principle (PRP) Ranking documents in order of decreasing
probability of relevance is optimal. View retrieval as a greedy search that aims to
identify the most valuable document. Assumptions of PRP:
Documents are independent. Complex information need is broken into a
number of queries which are each optimized in isolation.
Probability of relevance is only estimated.
![Page 8: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/8.jpg)
The Vector Space Model Measure closeness between query and
document. Queries and documents represented as n
dimensional vectors. Each dimension corresponds to a word. Advantages: Conceptual simplicity and use
of spatial proximity for semantic proximity.
![Page 9: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/9.jpg)
Vector Similarity d = The man said that a space age
man appeared d’ = Those men appeared to say their age
![Page 10: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/10.jpg)
Vector Similarity (Cont.) cosine measure or normalized
correlation coefficient
Euclidean Distance:
![Page 11: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/11.jpg)
Term Weighting Quantities used:
tfi,j (Term frequency) : # of occurrences of wi in di
dfi (Document frequency) : # of documents that wi occurs in
cfi (Collection frequency) : total # of occurrences of wi in the collection
![Page 12: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/12.jpg)
Term Weighting (Cont.) tfi,j = 1+log(tf), tf > 0 dfi : indicator of informativeness Inverse document frequency (IDF
weighting) TF.IDF (Term frequency & Inverse
Document Frequency): indicator of semantically focussed words:
1 tfif 0
1 tfif df
log))tflog(1(),(
ji,
ji,i
ji,N
jiweight
![Page 13: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/13.jpg)
Term Distribution Models Develop a model for the distribution of a
word and use this model to characterize its importance for retrieval.
Estimate pi(k): pi(k) : proportion of times that word wi
appears k times in document. Poisson, Two-Poisson and K mixture. We can derive the IDF from term
distribution models.
![Page 14: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/14.jpg)
The Poisson Distribution
the parameter i > 0 is the average number of occurrences of wi per document.
We are interested in the frequency of occurrence of a particular word wi in a document.
Poisson distribution is good for estimating non-content words.
0 somefor );(!
iki
k
ii
ekp
Ncfii
![Page 15: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/15.jpg)
The Two-Poisson Model Better fit to the frequency distribution
Mixture of two poissons Non-privileged class: Low average # of occurrences
Occurrences are accidental Privileged class: High average # of occurrences
Central content word
2121
)1(),,;( eekp!k2k
!1k
k
: probability of a document being in the privileged class1- : probability of a document being in the non-privileged class
1, 2 : average number of occurrence of word wi in each class
![Page 16: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/16.jpg)
The K Mixture More accurate
k
ki kp
11
)1()( 0,
Ncf
dfNlogIDF 2
dfdf-cf12IDF
: # of extra terms per document in which the term occurs : absolute frequency of the term.
![Page 17: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/17.jpg)
Latent Semantic Indexing Projects queries and documents into a space
with “latent” semantic dimensions. Dimensionality reduction: the latent semantic
space that we project into has fewer dimensions than the original space.
Exploits co-occurrence: the fact that two or more terms occur in the same document more often than chance.
Similarity metric: Co-occurring terms are projected onto the same dimensions.
![Page 18: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/18.jpg)
Singular Value Decomposition SVD takes a document-by-term matrix A
in n-dim space and projects it to A in a lower dimensional space k (n>>k). The 2-norm (distance) between the two matrices is minimized:
ˆ
2AA
![Page 19: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/19.jpg)
SVD (Cont) SVD projection:
Atxd – document-by-term matrix Ttxn – Terms in new space Snxn – Singular values of A in descending order Ddxn – document matrix in new space N = min (t,d) T, D have orthonormal columns
Tndnnntdt DSTA )(
![Page 20: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/20.jpg)
LSI in IR Encode terms and documents
using factors derived from SVD. Rank similarity of terms and docs
to query via Euclidean distances or cosines.
![Page 21: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/21.jpg)
LSI example
![Page 22: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/22.jpg)
LSI example cont.
![Page 23: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/23.jpg)
LSI example : original vs. dimension reduced
0.85 0.52 0.28 0.13 0.21 -0.08
0.36 0.36 0.16 -0.21 -0.03 -0.18
1.00 0.72 0.36 -0.05 0.16 -0.21
0.98 0.13 0.21 1.03 0.62 0.41
0.13 -0.39 -0.08 0.90 0.41 0.49
1 0 1 0 0 00 1 0 0 0 01 1 0 0 0 01 0 0 1 1 00 0 0 1 0 1
A =
k =21.05 -0.03 0.61 -0.02 0.29 -0.31
0.15 0.92 -0.18 -0.05 -0.12 0.06
0.87 1.07 0.15 0.04 0.10 -0.05
1.03 -0.02 0.29 0.99 0.64 0.35
-0.02 0.01 -0.31 1.01 0.35 0.66
k =3
![Page 24: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/24.jpg)
LSI example cont. Condensed representation of
documents
B=S2*2V2*n =
![Page 25: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/25.jpg)
LSI example - queryingq’ = qT Tk
Sk-1
For example: q=‘astronaut car’ =(0 1 0 1 0)
q’ = (0.38 0.01)
Query result cos(q’ ,Bi) = (0.96 0.56 0.81 0.72 0.91
0.40)
![Page 26: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/26.jpg)
Discourse Segmentation Break documents into topically coherent
multi-paragraph subparts. Detect topic shifts within document
![Page 27: Natural Language Processing Topics in Information Retrieval August, 2002](https://reader036.vdocuments.site/reader036/viewer/2022062906/5a4d1b517f8b9ab0599a7d74/html5/thumbnails/27.jpg)
TextTiling (Hearst and Plaunt, 1993) Search for vocabulary shifts from one
subtopic to another. Divide text into fixed size blocks (20 words). Look for topic shifts in-between these blocks.
Cohesion scorer: measures the topic continuity at each gap (point between two block).
Depth scorer: at a gap determine how low the cohesion score is compared to surrounding gaps.
Boundary selector: looks at the depth scores & selects the gaps that are the best segmentation points.