intelligent information retrieval cs 336 –lecture 3: text operations xiaoyan li spring 2006

Intelligent Information RetrievalCS 336 –Lecture 3: Text

Operations

Xiaoyan Li

Spring 2006

Topics

• 5-step Documents preprocessing • Porter stemming algorithm• Text compression

Five-Step Document Preprocessing• Lexical analysis of the text

– How to treat digits, hyphens, punctuation marks, the case of letters

• Elimination of stopwords– Words with low discrimination values

• Stemming – Removing prefixes and suffixes

• Selection of index terms– Determine which words/stems will be used as

indexing elements• Construction of term categorization structures

– a thesaurus,

Step 1: Lexical analysis of the text

• Converting the text of a document (a large string/or a stream of characters) to a stream of words– Word separators (English, Chinese)

• How to deal with digits, punctuation marks, hyphens, and the case of letters

Step 2: Elimination of stopwords

• Frequent words in the collection• Not good discriminators

– Filtered out as potential index terms

• Elimination of stopwords reduces the size of the indexing structure considerable.– 40% or more

• Examples– Articles, prepositions, conjunctions, etc.– Even some verbs, adverbs and adjectives

Step 3: Stemming

• Problem with perfect match:– One query word “connect” and its multiple

“connected”, “connecting”, “connects” in different documents

• Stemming: Reduce variants of the same root word to a common concept

• Stemming also reduces the number of distinct index terms

• The Porter Algorithm

Stemming Approaches• Table lookup

– Generation is complex– Final tables are often incomplete

• Affix removal– Suffix vs. prefix (e.g. mega-volt)– Doesn’t always work, esp. not in German

• Successor variety stemming– More complex than suffix removal– Uses (e.g.) linguistic approaches and techniques

from morphology• N-grams

– General clustering approach which can also be used for stemming

Step 4: Selection of index terms

• Full text representation vs. selected set of terms as index terms

• Many distinct automatic approaches• The identification of noun groups

(Inquery system)– Most of the semantics is carried by the noun

words in a sentence– Combine nearby nouns into noun groups.

Step 5: Construction of term categorization structures

• A thesaurus– A standard vocabulary for indexing and searching– Relationships among indexed terms– Assist users with locating terms for proper query

formulation

• An example of an entry in Roget’s thesaurus– Cowardly adjective – Ignobly lacking in courage: cowardly turncoats– Syns: chicken (slang), chicken-hearted, craven,

dastardly, faint-hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow-bellied (slang).

Thesauri

• Indexed terms– Denotes a concept, basic semantic unit– Can be individual words, group of words, or

phrases– Terms are basically nouns– Terms can also be verbs in gerund form

whenever they are used as nouns. (teaching, acting etc.)

• Relationships– A set of related terms to a entry is mostly

composed of synonyms or near-synonyms.

The Use of Thesauri in IR

• Selecting related terms in a thesaurus to reformulate a query when initial query words are erroneous and improper.

• Unfortunately, this approach does not work well in general.– Relationships captured in a thesaurus are not valid in

the local context of a given query.

• An alternative: determine thesaurus-like relationships at query time– Challenging for web search- can’t afford the effort for

each individual query

The Porter Algorithm

• Special algorithm for the English language based on suffix removal

• 5 successive distinct phases, applied to words sequentially one after another

• Example: Remove plural ‘s’ and ‘sses’ Rules: sses -> ss, s -> NIL (obey order!)

Porter Algorithm• Conventions

– C: consonant, V: vowel, L: consonant or vowel– Combination of C, V, L to define patterns– Operators ”+” and “*” to form complex patterns

• *: zero or more repetitions of a given pattern: (V*C)• +: one of more repetitions of a given pattern :( (C)*((V)+(C)+)

+(V)*) • Statements/commands

– Rule-base statements• Single rule: If (*V*L) then ed Nil (remove ed)• Multiple rules:

– Select rule with longest suffix{ sses ss ies i; ss ss; s->}

Try Porter Algorithm

• Played• Classes• Policy• Position• Capability• Active, actively, activity

The Porter Algorithm: advantages & disadvantages

• Advantage: Easy algorithm with good results– abate abated abatement abatements abates -->abat

• Disadvantage: Not always correct, e.g.

– Same root for police – policy, execute –executive, …

– Different root for european – europe, search – searcher,

Next Lecture:

• Compression. Ch. 7