information retrieval in text part i reference: michael w. berry and murray browne. understanding...

Information Retrieval in TextPart I

• Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM 1999.

• Reading Assignment: Chapters 1 and 2.

Outline

• Introduction• Basic Process of Information Retrieval• Content Representation

– Document Purification and Analysis– Item Normalization– Index Construction

• Manual Indexing• Automatic Indexing

– Inverted File Structures– Signature Files

Introduction

• Expectations from our search engines– Type principal, where one meant principle– Type Lanzcos, where one meant Lanczos– Type right and left, where one meant

• Party associations• Traffic laws• Chaos

– Find what we want from a gigantic collection of documents (handle the tsunami of data)

• We are asking the computer to supply the information we want, rather than the information we asked for– Reference librarians are already good at that, asking the

patron few questions before directing him to the results

Introduction

• An Information system consists of

– database of documents

– search engine

– interface

– search results

Basic Process Of IR

• Basic process of information retrieval can be described as :– Representing content of document

• Document Purification and Analysis• Item Normalization• Index Construction

– Representing User’s information• Query Representation• User Interface

– Ranking and Relevance Feedback• The main objective of an IR system is to increase precision

and recall, efficiently.

Precision and Recall

• Precision: how many of the documents retrieved by an algorithm are correct

• Recall: how many of the documents that should have been retrieved by an algorithm were in fact retrieved

• Average Precision

Document Purification and Analysis

• Unless documents are cleaned up (making sure every document has a title, begin and end, handle non-textual portions like images) wrong [portions of] documents may be retrieved


• Taking HTML documents, for example, one needs to decide which “tags” to index

• According to references published in 1997 and 1998, the following features are ignored in building a search engine index– <COMMENT> tags– <ALT TEXT> attribute– <META> tags– Image maps, frames, and some URLs


• Usually, search engines extract – text, excluding punctuation, from title tags,

header tags, and the first characters of an html file. This may include

• The first 100 significant words• The first 20 lines per record

• Search engines would ignore – invisible text– Text with smaller fonts– Words containing numbers


• Text formatting– Use standard ASCII/Unicode

• May need to convert certain formats to text or extract text information from them (e.g. postscript, pdf)

• What about OCRed documents?

Item Normalization

• Words must be sliced and diced before being considered for index construction. This may include– Identification of processing tokens (words)– Characterizations of tokens– Stemming of tokens

Item Normalization

• Applying stop lists to the collections of processing tokens– ftp://ftp.cs.cornell.edu/pub/smart/english.stop– E.g. able, about, after, allow, became, been,

before, certainly, clearly, enough, everywhere, etc.

– What to do with singletons (words appearing once in a collection of documents)?

Item Normalization

• Stemming: Removing of suffixes, and sometimes prefixes, to reduce a word to its root form.– E.g. reformation, reformative, reformatory, reformed,

and reformism can all be stemmed to • reform or form??????

– This saves considerable amount of space

– However, one may lose the context of search• E.g. someone looking for reformation and some results refer to

reformatories (reform schools)

– Syntactic stemmers vs. dictionary-based stemmers

Item Normalization

• Stemming Advantages– Reduces diversity of word representations

• Misspelled words are recognized• Handles plurals and common suffixes

– Increases recall

• Stemming Disadvantages– Retrieval of irrelevant documents (reduces precision)– Cannot be applied to proper nouns

• Currently available stemmers– Al Stem: http://tides.umiacs.umd.edu/software.html– http://www.nongnu.org/aramorph/javadoc/gpl/pierrick/brihaye/aramorph/

lucene/ArabicStemmer.html – Porter Stemmer: http://maya.cs.depaul.edu/~classes/ds575/porter.html – http://webscripts.softpedia.com/scriptDownload/Porter-Stemmer-

Download-42859.html

Index Construction

• Manual Indexing

• Automatic Indexing– Inverted File Structure– Signature Files– Vector Space Models

Manual Indexing• Every document is catalogued based on some

individual’s or group’s assessment of what that document is about, and an appropriate list of descriptive entries is generated.

• Advantage– Human indexers can establish relationships and

concepts between seemingly different topics that can be very useful to future readers

• Broader, narrower and related subjects

Manual Indexing

• Disadvantages– Expensive– Time consuming (think of manually indexing the Web)– Can be subject to the background and personality of the

indexer• Cleverdon reported that if two groups of people construct

thesauri in a particular subject area, the overlap of index terms was about 60%

• Moreover, if two indexers used the same thesaurus on the same document, common index terms that were shared were about 30%.

– May not be reproducible in case of modification or loss of information

Manual Indexing

• Manual indexing has shifted its focus toward

“the abstraction of concepts and judgments on the value of the information”

G. Kowalski, 1997

Manual Indexing

• Yahoo! (up to 1999)– Instead of a web crawler, web masters submit

URLs for Yahoo! to pursue. If Yahoo! thinks its appropriate, it is included in the index, otherwise not.

• Around 30% acceptance rate.

• What about sites fitting in more than one category?

• However, increases precision as index size is small

Manual Indexing

• EMBASE (Elsavier Science’s Bibliographic Database) Excerpta Medica DataBASE– Covers pharmacology and biomedicine– Uses machine-aided indexing to work hand in hand

with manual indexing• National Library of Medicine

– Publishes MeSH (Medical Subject Headings)– Uses indexers to assign as many headings as necessary

to characterize accurately the content of a journal article.

• H. W. Wilson Company (Similar to MeSH appropoach)

Automatic Indexing

• Using algorithms/software to extract terms for indexing is the predominant method for processing documents from large repositories.

• Consists of huge computerized robots crawling throughout the Web all day and night, collecting documents and indexing every word in the text.

• Concepts may result from the index construction stage (as with vector space models), or may feed the index construction (as with inverted file structures and signature files), which is similar to manual indexing.

Inverted File Structure

• Consists of a document file, an inversion list and a dictionary.

• Document File– Each document is given a unique identifier– Processing tokens within the document are identified

• Dictionary – A sorted list of all unique words or processing tokens in the

system and a pointer to the location of its inversion list.– May also include the frequency of each term in the collection

(global frequency)– N-grams and PAT trees are well-known data structures for

processing dictionaries• Inversion List

– Contains the pointer from the term to which documents contain that term [and the position in that document].

DOC#1,computer, bit, byte

DOC#2, memory, byte

DOC#3,computer, bit ,memory

DOC#4, byte, computer

DOCUMENTS

bit (2)

byte (3)

computer (3)

memory (2)

DICTIONARYINVERSION

LISTS

bit: 1, 3

byte: 1, 2, 4

computer: 1, 3, 4

memory: 2, 3

Figure 1: Inversion File Structure

• Inversion lists may also include the position within the document– May help in supporting queries of

• Phrases (consecutive keywords)

• Words within specified proximity

Pros

• Queries only interested in more recent information, only the latest databases need to be searched.

• Provide Optimum Performance.

• Concepts and their relationship can be stored.

Cons

• Space requirement for personal file system

• Needs exact spelling

Signature File

• Signature file search is a linear scan of the compressed version of items producing a response time linear with respect to file size .

• In Signature file indexing, each record is allocated a fixed-width signature, or bitstring, of w bits.

• Each word that appears in the record is hashed a number of times to determine the bits in the signature that should be set

Signature File

• Any record whose signature has a 1-bit corresponding to every 1-bit in the query signature is a potential answer

• Each such record must be fetched and checked directly against the query to determine whether it is a false match or a true match.

• Many variants of signature file are available

Signature Files

Pros & Cons

• Pros

Support Ranked Queries

• Cons

Variety of parameters be fixed in advance

Expensive for disjunctive queries

Response time is unpredictable

Not Scalable

information retrieval in text part i reference: michael w. berry and murray browne. understanding...

Documents

database of documents

analysisunless documents

ocred documents

extract text information

analysistaking html

retrieveddocument purification

urlsdocument purification

text retrieval