automatically building a stopword list for an information retrieval system

Automatically Building a Stopword List for an Information Retrieval System

University of Glasgow

Rachel Tsz-Wai Lo, Ben He, Iadh Ounis

Outline Stopwords Investigation of two approaches

Approach based on Zipf’s Law New Term-based random sampling

approach Experimental Setup Results and Analysis Conclusion

What is a Stopword? Common words in a document

e.g. the, is, and, am, to, it Contains no information about documents Low discrimination value in terms of IR

meaningless, no contribution Search with stopwords will usually result

in retrieving irrelevant documents

Objective Different collection contains different

contents and word patterns Different collections may require a

different set of stopwords Given a collection of documents Investigate ways to automatically

create a stopword list

Objective (cont)

1. Baseline Approach (benchmark) 4 variants inspired by Zipf’s Law

TF Normalised TF IDF Normalised IDF

2. How informative a term is (new proposed approach)

Fox’s Classical Stopword List and Its Weakness

Contains 733 stopwords > 20 years old Lacks potentially new words Defined for General Purpose

different collections require different stopword lists

Outdated

Zipf’s Law Based on the term frequencies of terms,

rank these terms accordingly term with highest TF will have rank = 1, next

highest term with rank = 2 etc

Zipf’s Law

r

CrF )(

Zipf’s Law

Baseline Approach Algorithm

Generate a list of frequencies vs terms based on corpus

Sort the frequencies in descending order Rank the terms according to their

frequencies. Highest frequencies would have rank=1 and next highest would have rank=2 etc.

Draw a graph of frequencies vs rank

Baseline Approach Algorithm (cont.)

Baseline Approach Algorithm (cont.)

Choose a threshold and any words that appear above the threshold are treated as stopwords

Run the queries with the above said stopword list, all stopwords in the queries will be removed

Evaluate system with Average Precision

Baseline Approach - Variants

Term Frequency Normalised Term Frequency

Inverse Document Frequency (IDF)

Normalised IDF

v

TFTFNorm log

kk D

NDocidf log

5.0

5.0)(log

k

kNormk D

DNDocidf

Baseline Approach – Choosing Threshold

Produce best set of stopwords > 50 stopword lists for each variant

Investigate the frequencies difference between two consecutive ranks

big difference (i.e. sudden jump) Important to choose appropriate

threshold

)()1( rFrF

Term-Based Random Sampling Approach (TBRSA)

Our proposed new approach Depends on how informative a

term is Based on the Kullback-Leibler

divergence measure Similar to the idea of query

expansion

Kullback-Leibler Divergence Measure

Used to measure the distance between two distributions.

In our case, distribution of two terms, one of which is a random term

The weight of a term t in the sampled document set is given by:

where

c

xx P

PPtw 2log)(

x

xx l

tfP

cc token

FP an

d

TBRAS Algorithm

KL divergence measure

Repeat Y times

Random term

Normalise weights by max weight

0.0 0.1 0.3 0.5 0.7

Rank in ascending order Top X ranked

0.0 0.1 0.3

Retrieve

TBRSA Algorithm (cont.)

Extract top L ranked as stopwords

0.05

0.05 1.00.75

0.1 0.15

0.15

0.75 0.8

0.8

1.00.3

0.1

0.051.0 0.850.5 0.90.80.1 0.150.3 0.70.0

0.0

0.0 0.3

0.7

0.7

sort

merge

0.0 0.70.1 0.30.150.05

Advantages / Disadvantages

Advantages based on how informative a term is computational effort minimal, compared to

baselines better coverage of collection No need to monitor progress

Disadvantages Generates first term randomly, could retrieve a

small data set Repeat experiments Y times

Experimental Setup Four TREC collections

http://trec.nist.gov/data/docs_eng.html Each collection is indexed and stemmed

with no pre-defined stopwords removed No assumption of stopwords in the beginning

Long queries were used Title, Description and Narrative

Maximise our chances of using the new stopword lists

Experimental Platform

Terrier - TERabyte RetrIEveR IR Group, University of Glasgow Based on Divergence From

Randomness (DFR) framework Deriving parameter-free probabilistic

models PL2 model http://ir.dcs.gla.ac.uk/terrier/

PL2 Model One of the DFR document weighting

models Relevance score of a document d for

query Q is:

where

Qt

dtwQdscore ),(),(

)2(log5.0log12

1log

1

1222 tfnetfn

tfn

tfntfn

tfnQt

l

lavgctftfn

_1log2 )0( c

Collections

disk45, WT2G, WT10G and DOTGOV

Collection

Size # Docs # Tokens c value

disk45 2GB 528155 801397 2.13

WT2G 2GB 247491 1020277 2.75

WT10G 10GB 1692096 3206346 2.43

DOTGOV 18GB 1247753 2821821 2.00

Queries

Collection

Query Sets # Queries

disk45 TREC7 and TREC8 of ad-hoc tasks

100

WT2G TREC8 50

WT10G TREC10 50

DOTGOV TREC11 and TREC12 merged 100

Merging Stopword Lists Merging classical with best generated

using baseline and novel approach respectively

Adding 2 lists together, removing duplicates

Might be stronger in terms of effectiveness Follows from classical IR technique of

combining evidence

Results and Analysis Produce as many sets of stopwords

(by choosing different thresholds for baseline approach)

Compare results obtained to Fox’s classical stopword list, based on average precision

Baseline Approach – Overall Results

* indicates significant difference at 0.05 level

Normalised IDF and for every collection

Collection

Classical

TF Norm TF

IDF Norm IDF

p-value

disk45 0.2123 0.2130

0.2123 0.2113

0.2130 0.8845

WT2G 0.2569 0.2650

0.2676 0.2682

0.2700 0.001508*

WT10G 0.2000 0.2049

0.2076 0.2079

0.2079 0.1231

DOTGOV 0.1223 0.1212

0.1208 0.1227

0.1227 0.55255

Baseline Approach – Additional Terms Produced

disk45 WT2G WT10G DOTGOV

financial html able content

company http copyright gov

president htm ok define

people internet http year

market web html administrate

london today january http

national policy history web

structure content facil economic

january document

html year

TBRSA – Overall Results

* indicates significant difference at 0.05 level disk45 and WT2G both show improvements

Collection Classical Best Obtain p-value

disk45 0.2123 0.2129 0.868

WT2G 0.2569 0.2668 0.07544

WT10G 0.2000 0.1900 0.4493

DOTGOV 0.1223 0.1180 0.002555*

TBRSA – Additional Terms Produced

disk45 WT2G WT10G DOTGOV

column advance copyright

server

general beach friend modify

califonia

company memory length

industry environment

mountain

content

month garden problem accept

director industry science inform

desk material special connect

economic

pollution internet gov

business

school document

byte

Refinement - Merging New approach (TBRSA) gives

comparable results Computation effort is less

Fox’s classical stopword list was very effective, despite its old age Worth using

Queries were quite “conservative”

Merging – Baseline Approach

* indicates significant difference at 0.05 level Produced a more effective stopword list

Collection Classical Norm IDF Merged p-value

disk45 0.2123 0.2130 0.2130 0.8845

WT2G 0.2569 0.2700 0.2712 0.00746*

WT10G 0.2000 0.2079 0.2109 0.03854*

DOTGOV 0.1223 0.1227 0.1241 0.6775

Merging – TBRSA

* indicates significant difference at 0.05 level Produced an improved stopword list with less

computational effort

Collection

Classical

Best Obtained

Merged

p-value

disk45 0.2123 0.2129 0.2129 0.868

WT2G 0.2569 0.2668 0.2703 0.008547*

WT10G 0.2000 0.1900 0.2066 0.4451

DOTGOV 0.1223 0.1180 0.1228 0.5085

Conclusion & Future Work

Proposed a novel approach for automatically generating a stopword list

Effectiveness and robustness Compared to 4 baseline variants,

based on Zipf’s Law Merge classical stopword list with

best found result to produce a more effective stopword list

Conclusion & Future Work (cont.)

Investigate other divergence metrics Poisson-based approach

Verb vs Noun “I can open a can of tuna with a can

opener” “to be or not to be”

Detect nature of context Might have to keep some of the terms but

remove others

Thank you! Any questions?

Thank you for your attention

automatically building a stopword list for an information retrieval system

Documents

term frequencies of

highest term

highest frequencies

term t

term isbased

list of frequencies

frequencies difference

zipfs lawnew term