automatically building a stopword list for an information retrieval system
DESCRIPTION
Automatically Building a Stopword List for an Information Retrieval System. University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis. Outline. Stopwords Investigation of two approaches Approach based on Zipf’s Law New Term-based random sampling approach Experimental Setup - PowerPoint PPT PresentationTRANSCRIPT
Automatically Building a Stopword List for an Information Retrieval System
University of Glasgow
Rachel Tsz-Wai Lo, Ben He, Iadh Ounis
Outline Stopwords Investigation of two approaches
Approach based on Zipf’s Law New Term-based random sampling
approach Experimental Setup Results and Analysis Conclusion
What is a Stopword? Common words in a document
e.g. the, is, and, am, to, it Contains no information about documents Low discrimination value in terms of IR
meaningless, no contribution Search with stopwords will usually result
in retrieving irrelevant documents
Objective Different collection contains different
contents and word patterns Different collections may require a
different set of stopwords Given a collection of documents Investigate ways to automatically
create a stopword list
Objective (cont)
1. Baseline Approach (benchmark) 4 variants inspired by Zipf’s Law
TF Normalised TF IDF Normalised IDF
2. How informative a term is (new proposed approach)
Fox’s Classical Stopword List and Its Weakness
Contains 733 stopwords > 20 years old Lacks potentially new words Defined for General Purpose
different collections require different stopword lists
Outdated
Zipf’s Law Based on the term frequencies of terms,
rank these terms accordingly term with highest TF will have rank = 1, next
highest term with rank = 2 etc
Zipf’s Law
r
CrF )(
Zipf’s Law
Baseline Approach Algorithm
Generate a list of frequencies vs terms based on corpus
Sort the frequencies in descending order Rank the terms according to their
frequencies. Highest frequencies would have rank=1 and next highest would have rank=2 etc.
Draw a graph of frequencies vs rank
Baseline Approach Algorithm (cont.)
Baseline Approach Algorithm (cont.)
Choose a threshold and any words that appear above the threshold are treated as stopwords
Run the queries with the above said stopword list, all stopwords in the queries will be removed
Evaluate system with Average Precision
Baseline Approach - Variants
Term Frequency Normalised Term Frequency
Inverse Document Frequency (IDF)
Normalised IDF
v
TFTFNorm log
kk D
NDocidf log
5.0
5.0)(log
k
kNormk D
DNDocidf
Baseline Approach – Choosing Threshold
Produce best set of stopwords > 50 stopword lists for each variant
Investigate the frequencies difference between two consecutive ranks
big difference (i.e. sudden jump) Important to choose appropriate
threshold
)()1( rFrF
Term-Based Random Sampling Approach (TBRSA)
Our proposed new approach Depends on how informative a
term is Based on the Kullback-Leibler
divergence measure Similar to the idea of query
expansion
Kullback-Leibler Divergence Measure
Used to measure the distance between two distributions.
In our case, distribution of two terms, one of which is a random term
The weight of a term t in the sampled document set is given by:
where
c
xx P
PPtw 2log)(
x
xx l
tfP
cc token
FP an
d
TBRAS Algorithm
KL divergence measure
Repeat Y times
Random term
Normalise weights by max weight
0.0 0.1 0.3 0.5 0.7
Rank in ascending order Top X ranked
0.0 0.1 0.3
Retrieve
TBRSA Algorithm (cont.)
Extract top L ranked as stopwords
0.05
0.05 1.00.75
0.1 0.15
0.15
0.75 0.8
0.8
1.00.3
0.1
0.051.0 0.850.5 0.90.80.1 0.150.3 0.70.0
0.0
0.0 0.3
0.7
0.7
sort
merge
0.0 0.70.1 0.30.150.05
Advantages / Disadvantages
Advantages based on how informative a term is computational effort minimal, compared to
baselines better coverage of collection No need to monitor progress
Disadvantages Generates first term randomly, could retrieve a
small data set Repeat experiments Y times
Experimental Setup Four TREC collections
http://trec.nist.gov/data/docs_eng.html Each collection is indexed and stemmed
with no pre-defined stopwords removed No assumption of stopwords in the beginning
Long queries were used Title, Description and Narrative
Maximise our chances of using the new stopword lists
Experimental Platform
Terrier - TERabyte RetrIEveR IR Group, University of Glasgow Based on Divergence From
Randomness (DFR) framework Deriving parameter-free probabilistic
models PL2 model http://ir.dcs.gla.ac.uk/terrier/
PL2 Model One of the DFR document weighting
models Relevance score of a document d for
query Q is:
where
Qt
dtwQdscore ),(),(
)2(log5.0log12
1log
1
1222 tfnetfn
tfn
tfntfn
tfnQt
l
lavgctftfn
_1log2 )0( c
Collections
disk45, WT2G, WT10G and DOTGOV
Collection
Size # Docs # Tokens c value
disk45 2GB 528155 801397 2.13
WT2G 2GB 247491 1020277 2.75
WT10G 10GB 1692096 3206346 2.43
DOTGOV 18GB 1247753 2821821 2.00
Queries
Collection
Query Sets # Queries
disk45 TREC7 and TREC8 of ad-hoc tasks
100
WT2G TREC8 50
WT10G TREC10 50
DOTGOV TREC11 and TREC12 merged 100
Merging Stopword Lists Merging classical with best generated
using baseline and novel approach respectively
Adding 2 lists together, removing duplicates
Might be stronger in terms of effectiveness Follows from classical IR technique of
combining evidence
Results and Analysis Produce as many sets of stopwords
(by choosing different thresholds for baseline approach)
Compare results obtained to Fox’s classical stopword list, based on average precision
Baseline Approach – Overall Results
* indicates significant difference at 0.05 level
Normalised IDF and for every collection
Collection
Classical
TF Norm TF
IDF Norm IDF
p-value
disk45 0.2123 0.2130
0.2123 0.2113
0.2130 0.8845
WT2G 0.2569 0.2650
0.2676 0.2682
0.2700 0.001508*
WT10G 0.2000 0.2049
0.2076 0.2079
0.2079 0.1231
DOTGOV 0.1223 0.1212
0.1208 0.1227
0.1227 0.55255
Baseline Approach – Additional Terms Produced
disk45 WT2G WT10G DOTGOV
financial html able content
company http copyright gov
president htm ok define
people internet http year
market web html administrate
london today january http
national policy history web
structure content facil economic
january document
html year
TBRSA – Overall Results
* indicates significant difference at 0.05 level disk45 and WT2G both show improvements
Collection Classical Best Obtain p-value
disk45 0.2123 0.2129 0.868
WT2G 0.2569 0.2668 0.07544
WT10G 0.2000 0.1900 0.4493
DOTGOV 0.1223 0.1180 0.002555*
TBRSA – Additional Terms Produced
disk45 WT2G WT10G DOTGOV
column advance copyright
server
general beach friend modify
califonia
company memory length
industry environment
mountain
content
month garden problem accept
director industry science inform
desk material special connect
economic
pollution internet gov
business
school document
byte
Refinement - Merging New approach (TBRSA) gives
comparable results Computation effort is less
Fox’s classical stopword list was very effective, despite its old age Worth using
Queries were quite “conservative”
Merging – Baseline Approach
* indicates significant difference at 0.05 level Produced a more effective stopword list
Collection Classical Norm IDF Merged p-value
disk45 0.2123 0.2130 0.2130 0.8845
WT2G 0.2569 0.2700 0.2712 0.00746*
WT10G 0.2000 0.2079 0.2109 0.03854*
DOTGOV 0.1223 0.1227 0.1241 0.6775
Merging – TBRSA
* indicates significant difference at 0.05 level Produced an improved stopword list with less
computational effort
Collection
Classical
Best Obtained
Merged
p-value
disk45 0.2123 0.2129 0.2129 0.868
WT2G 0.2569 0.2668 0.2703 0.008547*
WT10G 0.2000 0.1900 0.2066 0.4451
DOTGOV 0.1223 0.1180 0.1228 0.5085
Conclusion & Future Work
Proposed a novel approach for automatically generating a stopword list
Effectiveness and robustness Compared to 4 baseline variants,
based on Zipf’s Law Merge classical stopword list with
best found result to produce a more effective stopword list
Conclusion & Future Work (cont.)
Investigate other divergence metrics Poisson-based approach
Verb vs Noun “I can open a can of tuna with a can
opener” “to be or not to be”
Detect nature of context Might have to keep some of the terms but
remove others
Thank you! Any questions?
Thank you for your attention