october 2009hlt: conflation algorithms1 human language technology conflation algorithms

31
October 2009 HLT: Conflation Algorithm s 1 Human Language Technology Conflation Algorithms

Upload: maci-birt

Post on 02-Apr-2015

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 1

Human Language Technology

Conflation Algorithms

Page 2: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 2

Acknowledgements

• John Repici (2002) http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm

• Porter, M.F., 1980, An algorithm for suffix stripping, reprinted in Sparck Jones, Karen, and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1-55860-454-4. [Vince has a copy of this]

• Jurafsky & Martin appendix B pp 833-836.

Page 3: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 3

Conflation

COMPUTES

COMPUTE

COMPUTATIONCOMPUTABILITY

COMPUTING

COMPUTER

COMPUT

Page 4: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 4

Types of Conflation Algorithm

• Stemming– Process based - e.g. affix stripping

• Lemmatisation– Attempt to map to same lemma– POS dependent

• Morphological Analysis– Includes morpho-syntactic information

Page 5: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 5

Word Conflation Algorithms

• Morphological analysis versus conflation• Notion of word class used is application

dependent– Genealogy: Phonetic similarity– Information Retrieval: Semantic similarity

• Based on written language (not phonetic transcription)

• Well known algorithms– Soundex– Porter

Page 6: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 6

Soundex:Problems with Names

• Names can be misspelt: Rossner• Same name can be spelt in different ways

Kirkop; Chircop• Same name appears differently in different

cultures: Tchaikovsky; Chaicowski• To solve this problem, we need phonetically

oriented algorithms which can find similar sounding terms and names.

• Just such a family of algorithms exist and are called SoundExes, after the first patented version.

Page 7: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 7

The Soundex Algorithm

• A Soundex algorithm takes a word as input and produces a character string which identifies a set of words that are (roughly) phonetically alike.

• It is very handy for searching large databases• Originally developed 1918 by Margaret K. Odell

and Robert C. Russell of the US Bureau of Archives, to simplify census-taking.

Page 8: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 8

Soundex Algorithm 1

The Soundex Algorithm uses the following steps to encode a word:

1. The first character of the word is retained as the first character of the Soundex code.

2. The following letters are discarded: a,e,i,o,u,h,w, and y.

3. Remaining consonants are given a code number.4. If consonants having the same code number appear

consecutively, the number will only be coded once. (e.g. "B233" becomes "B23")

Page 9: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 9

Code Numbers

b, p, f, and v 1

c, s, k, g, j, q, x, z 2

d, t 3

l 4

m,n 5

r 6

Page 10: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 10

Soundex Algorithm: Example

The Soundex Algorithm uses the following steps to encode a word:

[ROSNER]1. The first character of the word is retained as the first

character of the Soundex code [R]2. The following letters are discarded: a,e,i,o,u,h,w, and

y. [RSNR]3. Remaining consonants are given a code number.

[R256]4. If consonants having the same code number appear

consecutively, the number will only be coded once. (e.g. "B233" becomes "B23")[R256]

Page 11: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 11

Soundex Algorithm 2

– The resulting code is modified so that it becomes exactly four characters long: If it is less than 4 characters, zeroes are added to the end (e.g. "B2" becomes "B200")

– If it is more than 4 characters, the code is truncated (e.g. "B2435" becomes "B243")

Page 12: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 12

Uses for the Soundex Code

• Airline reservations - The soundex code for a passenger's surname is often recorded to avoid confusion when trying to pronounce it.

• U.S. Census - As is noted above, the U.S. Census Department was a frequent user of the Soundex algorithm while trying to compile a listing of families around the turn of the century.

• Genealogy - In genealogy, the Soundex code is most often used to avoid problems when dealing with names that might have alternate spellings.

Page 13: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 13

Improvements

• Preprocessing before applying the basic algorithm, e.g. identification of

– DG with G – GH with H – GN with N (not 'ng') – KN with N – PH with F

• Question: where to stop?

• Question: how to evaluate?

Page 14: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 14

IR Applications

• Information Retrieval:

Query → → Relevant Documents

• “Bag of Terms” document model

• What is a single term?

Page 15: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 15

Why Stemming is Necessary

• Frequently we get collections of words of the following kind in the same document

compute, computer, computing, computation, computability ….

• Performance of IR system will be improved if all of these terms are conflated.– Less terms to worry about– More accurate statistics

Page 16: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 16

Issues

• Is a dictionary available?– Stems– Affixes

• Motivation: linguistic credibility or engineering performance?

• When to remove a affix versus when to leave it alone

• Porter (1980): W1 and W2 should be conflated if there appears to be no difference between the statements "this document is about W1/W2"

relate/relativity vs. radioactive/radioactivity

Page 17: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 17

Consonants and Vowels

• A consonant is a letter other than a,e,i,o,u and other than y preceded by a consonant: sky, (nb. y in toy is not regarded as a consonant).

• If a letter is not a consonant it is a vowel.• A sequence of consonants (cc..c) or vowels (vv..v) will

be represented by C or V respectively.• For example the word troubles maps to C V C V C• Any word or part of a word, therefore has one of the

following forms:

(CV)n….C(CV)n….V(VC)n….C(VC)n….V

Page 18: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 18

Measure

• All the above patterns can be replaced bythe following regular expression

(C) (VC)m (V)

• m is called the measure of any word or word part.

• m=0: tr, ee, tree, y, bym=1: trouble, oats, trees, ivym=2: troubles; private

Page 19: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 19

Rules

• Rules for removing a suffix are given in the form

(condition) S1 → S2

• i.e. if a word ends with suffix S1, and the stem before S1 satisfies the condition, then it is replaced with S2. Example

(m > 1) EMENT →

• Example: enlargement → enlarg

Page 20: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 20

Conditions

• *S - stem ends with s• *Z - stem ends with z• *T – stem ends with t• *v* - stem contains a vowel• *d - stem ends with a double consonant• *o - stem ends cvc, where second c is not w, x

or y e.g. –wil, -hop• In conditions, Boolean operators are possible

e.g. (m>1 and (*S or *T))• Sets of rules applied in 7 steps. Within each

step, rule matching longest suffix applies.

Page 21: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 21

OrganisationStep 1Plurals and Third Person Singular Verbs

Step 2Verbal Past Tense and Progressive

Step 3: Y to INoun Inflections

Steps 4 and 5Derivational MorphologyMultiple Suffixesvisualisation → visualise

Steps 6Derivational MorphologySingle Suffixes

Step 7Cleanup

-s

-ed, -ing fly/flies

Page 22: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 22

Step 1:Plural Nouns and 3rd Person Singular Verbs

condition rewrite example

SSES → SS caresses → caress

IES → I ponies → poni

SS → SS caress → caress

S → cats → cat

Page 23: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 23

Step 2a Verbal Past Tense and Progressive Forms

condition rewrite example

(m>1) EED → EE feed → feed

agreed → agree

(*v*) ED → ε plastered → plaster

bled → bled

(*v*) ING → ε killing → killsing → sing

Page 24: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 24

Step 2b: CleanupIf 2nd or 3rd of last step succeeds

condition rewrite example

AT → ATE generat → generate

BL → BLE troubl → trouble

IZ → IZE capsiz → capsize

*d and not

(*L or *S or *Z)

single letter

hopp → hop

hiss → hiss

Page 25: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 25

Step 3: Y to I

(*v*) Y → I happy → happi

cry → cry

Page 26: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 26

STEP 4: Derivational Morphology 1 – Multiple Suffixes (excerpt)

Condition Rewrite Example

(m > 0) ATIONAL → ATE relational → relate

(m > 0) TIONAL → TION conditional → condition

(m > 0) ENCI → ENCE valenci → valence

(m > 0) ABLI → ABLE comfortabli → comfortable

(m > 0) OUSLI → OUS analagously → analagous

(m > 0) IZATION → IZE digitizer → digitize

(m > 0) ATION → ATE generation → generate

(m > 0) ATOR → ATE operator → operate

(m > 0) ALISM → AL formalism → formal

(m > 0) IVENESS → IVE pensiveness → pensive

(m > 0) FULNESS → FUL hopefulness → hopeful

(m > 0) OUSNESS → OUS callousness → callous

(m > 0) ALITI → AL formality → formal

(m > 0) BILITI → BLE possibility → possible

Page 27: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 27

Step 6: Derivational Morphology III: Single Suffixes

Condition Rewrite Example

(m > 1) AL → ε revival → reviv

(m > 1) ANCE → ε allowance → allow

(m > 1) ENCE → ε inference → infer

(m > 1) ER → ε airliner → airlin

(m > 1) IC → ε Coptic → Copt

(m > 1) ABLE → ε laughable → laugh

(m > 1) ANT → ε irritant → irrit

(m > 1) EMENT → ε replacement → replac

(m > 1) MENT → ε adjustment → adjust

(m > 1) ENT → ε dependent → depend

(m > 0) (*S or *T) ION → ε adoption → adopt

(m > 1) OU → ε callousness → callous

(m > 1) ISM → ε formalism→ formal

(m > 1) ATE → ε activate → activ

ITI → ε

Page 28: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 28

Porter Example

• INPUTin the first focus area, integrated projects shall help develop, principally, common open platforms for software and services supporting a distributed information and decision systems for risk and crisis management

Page 29: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 29

Porter Output

Original Word Stemmed Word

first first

focus focu

area area

integrated integr

projects project

help help

develop develop

principally princip

common common

open open

platforms platform

Original Word Stemmed Word

platforms platform

software softwar

services servic

supporting support

distributed distribut

information inform

decision decis

systems system

risk risk

crisis crisi

management manag

Page 30: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 30

Stemming Errors• Under-stemming

– the error of taking off too small a suffix– croulons croulon– since croulons is a form of the verb crouler

• Over-stemming– the error of taking off too much– example: croûtons croût– since croûtons is the plural of croûton

• Miss-stemming– taking off what looks like an ending, but is really part

of the stem– reply rep

Page 31: October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms

October 2009 HLT: Conflation Algorithms 31

Summary

• Conflation serves different purposes

• Generally, motivation is to achieve an engineering goal rather than linguistic fidelity.

• This can cause errors in the bag of words model.

• Soundex and Porter very well established and easily available.