automatically building a stopword list for an information retrieval system

35
Automatically Building a Stopword List for an Information Retrieval System University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis

Upload: wilda

Post on 01-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Automatically Building a Stopword List for an Information Retrieval System. University of Glasgow Rachel Tsz-Wai Lo, Ben He, Iadh Ounis. Outline. Stopwords Investigation of two approaches Approach based on Zipf’s Law New Term-based random sampling approach Experimental Setup - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automatically Building a Stopword List for an Information Retrieval System

Automatically Building a Stopword List for an Information Retrieval System

University of Glasgow

Rachel Tsz-Wai Lo, Ben He, Iadh Ounis

Page 2: Automatically Building a Stopword List for an Information Retrieval System

Outline Stopwords Investigation of two approaches

Approach based on Zipf’s Law New Term-based random sampling

approach Experimental Setup Results and Analysis Conclusion

Page 3: Automatically Building a Stopword List for an Information Retrieval System

What is a Stopword? Common words in a document

e.g. the, is, and, am, to, it Contains no information about documents Low discrimination value in terms of IR

meaningless, no contribution Search with stopwords will usually result

in retrieving irrelevant documents

Page 4: Automatically Building a Stopword List for an Information Retrieval System

Objective Different collection contains different

contents and word patterns Different collections may require a

different set of stopwords Given a collection of documents Investigate ways to automatically

create a stopword list

Page 5: Automatically Building a Stopword List for an Information Retrieval System

Objective (cont)

1. Baseline Approach (benchmark) 4 variants inspired by Zipf’s Law

TF Normalised TF IDF Normalised IDF

2. How informative a term is (new proposed approach)

Page 6: Automatically Building a Stopword List for an Information Retrieval System

Fox’s Classical Stopword List and Its Weakness

Contains 733 stopwords > 20 years old Lacks potentially new words Defined for General Purpose

different collections require different stopword lists

Outdated

Page 7: Automatically Building a Stopword List for an Information Retrieval System

Zipf’s Law Based on the term frequencies of terms,

rank these terms accordingly term with highest TF will have rank = 1, next

highest term with rank = 2 etc

Zipf’s Law

r

CrF )(

Page 8: Automatically Building a Stopword List for an Information Retrieval System

Zipf’s Law

Page 9: Automatically Building a Stopword List for an Information Retrieval System

Baseline Approach Algorithm

Generate a list of frequencies vs terms based on corpus

Sort the frequencies in descending order Rank the terms according to their

frequencies. Highest frequencies would have rank=1 and next highest would have rank=2 etc.

Draw a graph of frequencies vs rank

Page 10: Automatically Building a Stopword List for an Information Retrieval System

Baseline Approach Algorithm (cont.)

Page 11: Automatically Building a Stopword List for an Information Retrieval System

Baseline Approach Algorithm (cont.)

Choose a threshold and any words that appear above the threshold are treated as stopwords

Run the queries with the above said stopword list, all stopwords in the queries will be removed

Evaluate system with Average Precision

Page 12: Automatically Building a Stopword List for an Information Retrieval System

Baseline Approach - Variants

Term Frequency Normalised Term Frequency

Inverse Document Frequency (IDF)

Normalised IDF

v

TFTFNorm log

kk D

NDocidf log

5.0

5.0)(log

k

kNormk D

DNDocidf

Page 13: Automatically Building a Stopword List for an Information Retrieval System

Baseline Approach – Choosing Threshold

Produce best set of stopwords > 50 stopword lists for each variant

Investigate the frequencies difference between two consecutive ranks

big difference (i.e. sudden jump) Important to choose appropriate

threshold

)()1( rFrF

Page 14: Automatically Building a Stopword List for an Information Retrieval System

Term-Based Random Sampling Approach (TBRSA)

Our proposed new approach Depends on how informative a

term is Based on the Kullback-Leibler

divergence measure Similar to the idea of query

expansion

Page 15: Automatically Building a Stopword List for an Information Retrieval System

Kullback-Leibler Divergence Measure

Used to measure the distance between two distributions.

In our case, distribution of two terms, one of which is a random term

The weight of a term t in the sampled document set is given by:

where

c

xx P

PPtw 2log)(

x

xx l

tfP

cc token

FP an

d

Page 16: Automatically Building a Stopword List for an Information Retrieval System

TBRAS Algorithm

KL divergence measure

Repeat Y times

Random term

Normalise weights by max weight

0.0 0.1 0.3 0.5 0.7

Rank in ascending order Top X ranked

0.0 0.1 0.3

Retrieve

Page 17: Automatically Building a Stopword List for an Information Retrieval System

TBRSA Algorithm (cont.)

Extract top L ranked as stopwords

0.05

0.05 1.00.75

0.1 0.15

0.15

0.75 0.8

0.8

1.00.3

0.1

0.051.0 0.850.5 0.90.80.1 0.150.3 0.70.0

0.0

0.0 0.3

0.7

0.7

sort

merge

0.0 0.70.1 0.30.150.05

Page 18: Automatically Building a Stopword List for an Information Retrieval System

Advantages / Disadvantages

Advantages based on how informative a term is computational effort minimal, compared to

baselines better coverage of collection No need to monitor progress

Disadvantages Generates first term randomly, could retrieve a

small data set Repeat experiments Y times

Page 19: Automatically Building a Stopword List for an Information Retrieval System

Experimental Setup Four TREC collections

http://trec.nist.gov/data/docs_eng.html Each collection is indexed and stemmed

with no pre-defined stopwords removed No assumption of stopwords in the beginning

Long queries were used Title, Description and Narrative

Maximise our chances of using the new stopword lists

Page 20: Automatically Building a Stopword List for an Information Retrieval System

Experimental Platform

Terrier - TERabyte RetrIEveR IR Group, University of Glasgow Based on Divergence From

Randomness (DFR) framework Deriving parameter-free probabilistic

models PL2 model http://ir.dcs.gla.ac.uk/terrier/

Page 21: Automatically Building a Stopword List for an Information Retrieval System

PL2 Model One of the DFR document weighting

models Relevance score of a document d for

query Q is:

where

Qt

dtwQdscore ),(),(

)2(log5.0log12

1log

1

1222 tfnetfn

tfn

tfntfn

tfnQt

l

lavgctftfn

_1log2 )0( c

Page 22: Automatically Building a Stopword List for an Information Retrieval System

Collections

disk45, WT2G, WT10G and DOTGOV

Collection

Size # Docs # Tokens c value

disk45 2GB 528155 801397 2.13

WT2G 2GB 247491 1020277 2.75

WT10G 10GB 1692096 3206346 2.43

DOTGOV 18GB 1247753 2821821 2.00

Page 23: Automatically Building a Stopword List for an Information Retrieval System

Queries

Collection

Query Sets # Queries

disk45 TREC7 and TREC8 of ad-hoc tasks

100

WT2G TREC8 50

WT10G TREC10 50

DOTGOV TREC11 and TREC12 merged 100

Page 24: Automatically Building a Stopword List for an Information Retrieval System

Merging Stopword Lists Merging classical with best generated

using baseline and novel approach respectively

Adding 2 lists together, removing duplicates

Might be stronger in terms of effectiveness Follows from classical IR technique of

combining evidence

Page 25: Automatically Building a Stopword List for an Information Retrieval System

Results and Analysis Produce as many sets of stopwords

(by choosing different thresholds for baseline approach)

Compare results obtained to Fox’s classical stopword list, based on average precision

Page 26: Automatically Building a Stopword List for an Information Retrieval System

Baseline Approach – Overall Results

* indicates significant difference at 0.05 level

Normalised IDF and for every collection

Collection

Classical

TF Norm TF

IDF Norm IDF

p-value

disk45 0.2123 0.2130

0.2123 0.2113

0.2130 0.8845

WT2G 0.2569 0.2650

0.2676 0.2682

0.2700 0.001508*

WT10G 0.2000 0.2049

0.2076 0.2079

0.2079 0.1231

DOTGOV 0.1223 0.1212

0.1208 0.1227

0.1227 0.55255

Page 27: Automatically Building a Stopword List for an Information Retrieval System

Baseline Approach – Additional Terms Produced

disk45 WT2G WT10G DOTGOV

financial html able content

company http copyright gov

president htm ok define

people internet http year

market web html administrate

london today january http

national policy history web

structure content facil economic

january document

html year

Page 28: Automatically Building a Stopword List for an Information Retrieval System

TBRSA – Overall Results

* indicates significant difference at 0.05 level disk45 and WT2G both show improvements

Collection Classical Best Obtain p-value

disk45 0.2123 0.2129 0.868

WT2G 0.2569 0.2668 0.07544

WT10G 0.2000 0.1900 0.4493

DOTGOV 0.1223 0.1180 0.002555*

Page 29: Automatically Building a Stopword List for an Information Retrieval System

TBRSA – Additional Terms Produced

disk45 WT2G WT10G DOTGOV

column advance copyright

server

general beach friend modify

califonia

company memory length

industry environment

mountain

content

month garden problem accept

director industry science inform

desk material special connect

economic

pollution internet gov

business

school document

byte

Page 30: Automatically Building a Stopword List for an Information Retrieval System

Refinement - Merging New approach (TBRSA) gives

comparable results Computation effort is less

Fox’s classical stopword list was very effective, despite its old age Worth using

Queries were quite “conservative”

Page 31: Automatically Building a Stopword List for an Information Retrieval System

Merging – Baseline Approach

* indicates significant difference at 0.05 level Produced a more effective stopword list

Collection Classical Norm IDF Merged p-value

disk45 0.2123 0.2130 0.2130 0.8845

WT2G 0.2569 0.2700 0.2712 0.00746*

WT10G 0.2000 0.2079 0.2109 0.03854*

DOTGOV 0.1223 0.1227 0.1241 0.6775

Page 32: Automatically Building a Stopword List for an Information Retrieval System

Merging – TBRSA

* indicates significant difference at 0.05 level Produced an improved stopword list with less

computational effort

Collection

Classical

Best Obtained

Merged

p-value

disk45 0.2123 0.2129 0.2129 0.868

WT2G 0.2569 0.2668 0.2703 0.008547*

WT10G 0.2000 0.1900 0.2066 0.4451

DOTGOV 0.1223 0.1180 0.1228 0.5085

Page 33: Automatically Building a Stopword List for an Information Retrieval System

Conclusion & Future Work

Proposed a novel approach for automatically generating a stopword list

Effectiveness and robustness Compared to 4 baseline variants,

based on Zipf’s Law Merge classical stopword list with

best found result to produce a more effective stopword list

Page 34: Automatically Building a Stopword List for an Information Retrieval System

Conclusion & Future Work (cont.)

Investigate other divergence metrics Poisson-based approach

Verb vs Noun “I can open a can of tuna with a can

opener” “to be or not to be”

Detect nature of context Might have to keep some of the terms but

remove others

Page 35: Automatically Building a Stopword List for an Information Retrieval System

Thank you! Any questions?

Thank you for your attention