©2002 paula matuszek iminer introduction. ©2002 paula matuszek iminer from ibm l text mining tool...

56
©2002 Paula Matuszek iMiner Introduction

Upload: owen-sanders

Post on 29-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

©2002 Paula Matuszek

iMiner Introduction

©2002 Paula Matuszek

iMiner from IBM Text Mining tool with multiple components Text Analysis tools includ

– Language Identification Tool

– Feature Extraction Tool

– Summarizer Tool

– Topic Categorization Tool

– Clustering Tools – http://www-4.ibm.com/software/data/iminer/fortext/index.html– http://www-4.ibm.com/software/data/iminer/fortext/presentations/im4t23engl/im4t23engl1.htm

©2002 Paula Matuszek

iMiner for Text 2 Basic technology includes:

– authority file with terms

– heuristics for extracting additional terms

– heuristics for extracting other features

– Dictionaries with parts of speech

– Partial parsing for part-of-speech tagging

– Significance measure for terms: Information Quotient (IQ).

Knowledge base cannot be directly expanded by end user

Strong machine-learning component

©2002 Paula Matuszek

Language Identification Can analyze

– an entire document

– a text string input from the command line Currently handles about a dozen language Can be trained; ML tool with input in

language to be learned Determines approximate proportion in

bilingual documents

©2002 Paula Matuszek

Language Identification Basically treated as a categorization

problem, where each language is a category

Training documents are processed to extract terms.

Importance of terms for categorization is determined statistically

Dictionaries of weighted terms are used to determine language of new documents

©2002 Paula Matuszek

Feature Extraction Locate and categorize relevant features in

text Some features are themselves of interest Also starting point for other tools like

classifiers, categorizers. Features may or may not be “meaningful” to a

person Goal is to find aspects of a document which

somehow characterize it

©2002 Paula Matuszek

Name Extraction Extracting Proper Names

– People, places, organizations

– Valuable clues to subject of text

Dictionaries of canonical forms Additional names extracted from documents

– Parsing finds tokens

– Additional parsing groups tokens into noun phrases

– Rules identify tokens which are names

– Variant groups are assigned a canonical name which is the most explicit variant found in document

©2002 Paula Matuszek

Examples for Name Extraction

“This subject is taught by Paula Matuszek.”– Recognize Paula as a first name of a person

– Recognize Matuszek as a capitalized word following a first name.

– Therefore “Paula Matuszek” is probably the name of a person.

“This subject is taught by Villanova University.”– Recognize Villanova as a probable name based on capitalization.

– Reognize University as a term which normally names an institution..

– Therefore “Villanova University” is probably the name of an institution.

“This subject is taught by Howard University”– BOTH of these sets of rules could apply. So rules need to be prioritized to

determine more likely parse.

©2002 Paula Matuszek

Other Rule Examples Dr., Mr,. Ms. are titles, and titles followed

by capitalized words frequently indicate names. If followed by only one word, it’s the last name

Capitalized word followed by single capitalized letter followed by capitalized word is probably FN MI LN.

Nouns can be names. Verbs can’t.

©2002 Paula Matuszek

Abbreviation/Acronym Extraction

Fruitful source of variants for names and terms

Existing dictionary of common terms Name followed by “(“ [A-Z]+ “)” probably

gives an abbreviation. Conventions regarding word internal

case and prefixes. “MSDOS” matches “MicroSoft DOS”, “GB” matches gigabyte.

©2002 Paula Matuszek

Number Extraction Useful primarily to improve performance

of other extractors. Variant expressions of numbers

– One thousand three hundred and twenty seven– thirteen twenty seven– 1327

Other numeric expressions– twenty-seven percent– 27%

Base forms are easy; most of effort is variants and determining canonical form based on rules

©2002 Paula Matuszek

Date Extraction

Absolute and relative dates Produces canonical form.

– March 27, 1997 1997/03/27– tomorrow ref+0000/00/01– a year ago ref-0001/00/00

Similar techniques and issues as for numbers

©2002 Paula Matuszek

Money Extraction

Recognizes currencies and produces canonical representation

Uses number extractor Examples

– “twenty-seven dollars” “27.000 dollars USA”– “DM 27” “27.000 marks Germany”

©2002 Paula Matuszek

Term Extraction

Identify other important terms found in text Other major lexical clue for subject, especially if

repeated. May use output from other extractors in rules Recognizes common lexical variants and

reduces to canonical form -- stemming Machine learning is much more important here

©2002 Paula Matuszek

Term Extraction

Dictionary with parts of speech info for English Pattern matching to find noun phrase structure

typical of technical terms. Feature repositories:

– Authority dictionary: canonical forms, variants, correct feature map. Used BEFORE heuristics

– Residue dictionary: complex feature type (name, term, pattern). Used AFTER heuristics

Authority and residue dictionaries trained

©2002 Paula Matuszek

Information Quotient Each feature (word, phrase, name) extracted is

assigned an information quotient Represents the significance of the feature in the

document TF-IDF: Term frequency-Inverse Document

Frequency Position information Stop words

©2002 Paula Matuszek

Feature Extraction Demo

Tool may be used for highlighting, etc, on documents to be displayed

Features extracted also form basis for other tools

Note that this is not full information extraction, although it is a starting point

http://www-4.ibm.com/software/data/iminer/fortext/extract/extractDemo.html

©2002 Paula Matuszek

Other Features

Feature Extractor also identifies other features used by other text analysis tools:– sentence boundaries– paragraph boundaries– document tags– document structure– collection statistics

©2002 Paula Matuszek

Summarizer Tools Collection of sentences extracted from

document Characteristic of document content Works best for well-structured documents Can specify length Must apply feature extraction first

©2002 Paula Matuszek

Summarizer Feature extractor run first Words are ranked Sentences are ranked Highest ranked sentences are chosen Configurable: for length of sentence, for

word salience Works best when document is part of a

collection

©2002 Paula Matuszek

Word Ranking

Words scored IF– Appears in structures such as titles and captions

– Occurs more often in document than in collection (word salience)

– Occurs more than once in a document

Score is– salience if > threshold: tf*idf (by default)

– weighting factor if occurs in title, heading caption

©2002 Paula Matuszek

Sentence Ranking

Scored according to relevance in document and position in document.

Sum of– Scores of individual words – Proximity of sentence to beginning of its paragraph– “Bonus” for final sentence in long paragraph and

final paragraph in long documents– Proximity of paragraph to beginning of document

All configurable

©2002 Paula Matuszek

Summarization Examples

Examples from IBM documentation

http://www-4.ibm.com/software/data/iminer/fortext/summarize/summarizeDemo.html

©2002 Paula Matuszek

Some Common Statistical Measures(a brief digression)

TF x IDF Pairwise and multiple-word phrase counts Some other common statistical measures:

– information gain: how many bits of information do we gain by knowing that a term is present in a document

– mutual information: how likely a term is to occur in a document

– term strength: likelihood that a document will occur in both of two closely-related documents

©2002 Paula Matuszek

Topic Categorization Tool Assign documents to predetermined

categories Must first be trained

– Training tool creates category scheme

– Dictionary that stores significant vocabulary statistics

Output is list of possible categories and probabilities for each document

Can filter initial schema for faster processing

©2002 Paula Matuszek

Features Used for Categorizing

Linguistic Features– Uses the features extracted by Feature Extraction tool

N-Grams– letter groupings and short words.

– Can be used for non-English, because it doesn’t depend on heuristics

– Used by Language categorizer

©2002 Paula Matuszek

Document Categorizing Individual document is analyzed for

features Features are compared to those

determined for categories:– terms present/absent– IQ of terms– frequencies– document structure

©2002 Paula Matuszek

Document Categorization Important issue is determining which features!

High dimensionality is expensive. Ideally you want a small set of features which is

– present in all documents of one category– absent in all other documents

In actuality, not that clean. So:– use features with relatively high separation– eliminate feature which correlates very highly with

another feature (to reduce dimension space)

©2002 Paula Matuszek

Categorization Demo Typically categorization is a component in a

system which then “does something” with the categorized documents

Ambiguous documents (not assigned to any one category with high probability) often indicate a new category evolving.

http://www-4.ibm.com/software/data/iminer/fortext/categorize/categorize.

©2002 Paula Matuszek

Clustering Tools Organize documents without pre-existing

categories Hierarchical clustering

– creates a tree where each leaf is a document, each cluster is positioned under the most similar cluster one step up

Binary Relational clustering– Creates a flat set of clusters with each

document assigned to its best fit and relations between clusters captured

©2002 Paula Matuszek

Hierarchical Clustering

Input is a set of documents Output is a dendogram

– Root

– Intermediate level

– leaves– link to actual documents

Slicing is used to create manageable HTML tree

©2002 Paula Matuszek

Steps in Hierarchical Clustering

Select Linguistic Preprocessing technique: determines “similarity”

Cluster documents: create dendogram based on similairy

Define shape of tree with slicing technique and produce HTML output

©2002 Paula Matuszek

Linguistic Preprocessing Determining similarity between documents and

clusters: how do we define “similar”?– Lexical affinity. Does not require any

preprocessing– Linguistic Features. Requires that feature

extractor be run first. iMiner is either/or; you cannot combine the

two methods of determining similiarity

©2002 Paula Matuszek

Clustering: Lexical Affinities

Lexical affinities: groups of words which appear frequently close together– created “on the fly” during a clustering task– word pairs– stemming and other morphological analysis– stop words

Results in documents with textual similiarity being clustered together

©2002 Paula Matuszek

Clustering: Linguistic Features

Linguistic features: Use features extracted by the feature extraction tool– Names of organizations– Domain Technical Terms– Names of Individuals

Can allow focusing on specific areas of interest

Best if you have some idea what you are interested in

©2002 Paula Matuszek

Hierarchical Clustering Steps

Put each document in a cluster, characterized by its lexical or linguistic features

Merge the two most similar clusters Continue till all clusters are merged

©2002 Paula Matuszek

Hierarchical Clustering: Slicing

The Dendogram is too big to be useful Slicing reduces the size of the tree by

merging clusters if they are “similar enough”.– top threshold: collapse any tree which

exceeds it– bottom threshold: group under root any

cluster which is lower– Remaining clusters make a new tree– # of steps sets depth of tree

©2002 Paula Matuszek

Typical Slicing Parameters Bottom

– start around 5% or 10% similar– 90% would mean only virtually identical

documents get grouped Top

– good default is 90%– if want really identical, set to 100%

Depth: – Typically 2 to 10– Two would give you duplicates and rest

©2002 Paula Matuszek

Binary Relational Clustering

Binary Relational clustering– Creates a flat set of clusters

– Each document assigned to its best fit

– Relations between clusters captured Similarity based on features extracted by

Feature Extraction tool

©2002 Paula Matuszek

Relational Clustering: Document Similarity

Based on comparison of descriptors– Frequent descriptors across collection given

more weight: priority to wide topics– Rare descriptors given more weight: large

number of very focused clusters– Both, with rare descriptors given slightly

higher weight: relatively focused topics but fewer clusters

Descriptors are binary: present or absent

©2002 Paula Matuszek

Relational Clustering

Descriptors are features extracted by feature extraction tool.

Similarity threshold: at 100% only identical documents are clustered

Max # of clusters: overrides similiarity threshold to get number of clusters specified

©2002 Paula Matuszek

Binary Relational Clustering Outputs

Outputs are – clusters: topics found, importance of topics,

degree of similiarity in cluster

– links: sets of common descriptors between clusters

©2002 Paula Matuszek

Clustering Demo

Patents from “class 395”: information processing system organization

10% for top, 1% for bottom, total of 5 slices lexical affinity http://www-4.ibm.com/software/data/iminer/fortext/cluster/clusterDemo.html

©2002 Paula Matuszek

Summary iMiner has a rich set of text mining tools Product is well-developed, stable No explicit user-modifiable knowledge

base -- uses automated techniques and built-in KB to extract relevant information

Can be deployed to new domains without a lot of additional work

BUT not as effective in many domains as a tool with a good KB

No real information extraction capability

©2002 Paula Matuszek

Information Extraction Overview

Given a body of text: extract from it some well-defined set of information

MUC conferences Typically draws heavily on NLP Three main components:

– Domain knowledge base– Extraction Engine– Knowledge model

©2002 Paula Matuszek

Information Extraction Domain Knowledge Base

Terms: enumerated list of strings which are all members of some class. – “January”, “February”– “Smith”, “Wong”, “Martinez”, “Matuszek”– “”lysine”, “alanine”, “cysteine”

Classes: general categories of terms– Monthnames, Last Names, Amino acids– Capitalized nouns, – Verb Phrases

©2002 Paula Matuszek

Domain Knowledge Base

Rules: LHS, RHS, salience Left Hand Side (LHS): a pattern to be

matched, written as relationships among terms and classes

Right Hand Side (RHS): an action to be taken when the pattern is found

Salience: priority of this rule (weight, strength, confidence)

©2002 Paula Matuszek

Some Rule Examples: <Monthname> <Year> => <Date> <Date> <Name> => print “Birthdate”, <Name>,

<Date> <Name> <Address> => create address database

record <daynumber> “/” <monthnumber> “/” <year> =>

create date database record (50) <monthnumber> “/” <daynumber> “/” <year> =>

create date database record (60) <capitalized noun> <single letter> “.” <capitalized

noun> => <Name> <noun phrase> <to be verb> <noun phrase> => create

“relationship” database record

©2002 Paula Matuszek

Generic KB Generic KB: KB likely to be useful in

many domains– names– dates– places– organizations

Almost all systems have one Limited by cost of development: it takes

about 200 rules to define dates reasonably well, for instance.

©2002 Paula Matuszek

Domain-specific KB We mostly can’t afford to build a KB for

the entire world. However, most applications are fairly

domain-specific. Therefore we build domain-specific KBs

which identify the kind of information we are interested in.– Protein-protein interactions– airline flights– terrorist activities

©2002 Paula Matuszek

Domain-specific KBs

Typically start with the generic KBs Add terminology Figure out what kinds of information you

want to extract Add rules to identify it Test against documents which have

been human-scored to determine precision and recall for individual items.

©2002 Paula Matuszek

Knowledge Model We aren’t looking for documents, we are

looking for information. What information? Typically we have a knowledge model or

schema which identifies the information components we want and their relationship

Typically looks very much like a DB schema or object definition

©2002 Paula Matuszek

Knowledge Model Examples

Personal records– Name

– First name– Middle Initial– Last Name

– Birthdate– Month– Day– Year

– Address

©2002 Paula Matuszek

Knowledge Model Examples

Protein Inhibitors– Protein name (class?)– Compound name (class?)– Pointer to source– Cache of text– Offset into text

©2002 Paula Matuszek

Knowledge Model Examples

Airline Flight Record– Airline

– Flight Number Origin Destination Date

» Status» departure time» arrival time

©2002 Paula Matuszek

Summary Text mining below the document level NOT typically interactive, because it’s

slow (1 to 100 meg of text/hr) Typically builds up a DB of information

which can then be queries Uses a combination of term- and rule-

driven analysis and NLP parsing. AeroText: very good system developed

by LMCO; we will get a complete demo on March 26.