business intelligence & data mining-11

8/10/2019 Business Intelligence & Data Mining-11

1/37

Text Mining


2/37

Motivation for Text Mining

n Approximately 90% of the worlds data is held inunstructured formats (source: Oracle Corporation)

n Information intensive business processes demandthat we transcend from data mining to knowledgediscovery in text.

90%

Structured Numerical or Coded

Information10%

Unstructured or Semi-structured

Information


3/37

Text Databases

n Large collections of documents from varioussources: news articles, research papers,books, digital libraries, e-mail messages, and

Web pages, library databasesn Properties

n Unstructured in general (some semi-structured

e.g. XML)

n Semantics, not only syntax, is important

n Non-numeric in nature


4/37

Challenges of Text Mining

n Very high number of possible dimensionsn All possible word and phrase types in the language!!

n Unlike data mining:n records (= docs) are not structurally identical

n records are not statistically independent

n Complex and subtle relationships betweenconcepts in textn AOL merges with Time-Warner

n Time-Warner is bought by AOL

n Ambiguity and context sensitivityn automobile = car = vehicle = Toyota

n Apple (the company) or apple (the fruit)


5/37

Technological Advances in Text Mining

nAdvances in text processing technology

n Natural Language Processing (NLP)

n New algorithms

n Cheap Hardware!

n CPU

n Disk

n

Network


6/37

Data Mining Vs Text Mining &Search versus Discover

Data

Mining

TextMining

Data

Retrieval

DocumentRetrieval

Search

(goal-oriented)

Discover

Structured

Data

Unstructured

Data (Text)


7/37

Text Databases and

Information Retrieval

n Information / document retrievaln Traditional study of how to retrieve information

from text documentsn Information is organized into (a large number of)

documents

n Information retrieval problem: locating relevantdocuments based on user input, such askeywords, queries or example documents


8/37

Text Database andInformation Retrieval

n Typical IR systemsn Online library catalogs

n Online document management systems

n

Information retrieval vs. database systemsn Some DB problems are not present in IR, e.g.,

update, transaction management, complexobjects

n Some IR problems are not addressed well inDBMS, e.g., unstructured documents, approximatesearch using keywords and relevance


9/37


10/37

What is Text Mining?

Text Mining is the process of synthesizing information

by analyzing the relations, patterns, and rules within

textual data: semi-structured or unstructured text.

The Tool BoxData mining algorithms

Machine learning techniques

Document / information retrieval concepts

Statistical techniques

Natural-language processing


11/37

Potential Applications

Customer comment analysis

Trend analysis

Information filtering and routing

Event tracking

News story classification

Web search

Sentiment analysis

.


12/37

Basic Measures: Precision & Recall

n Precision: the percentage of retrieved documents that are in factrelevant to the query (i.e., correct responses)

n Recall: the percentage of documents that are relevant to the queryand were, in fact, retrieved

|}{|

|}{}{|

Relevant

RetrievedRelevantrecall

=

|}{|

|}{}{|

Retrieved

etrievedRelevantprecision

=


13/37

Precision decreases as Recallincreases

SVMDecision TreeSOMLogistic Regr

Precision

RecallDumais

(1998)


14/37

F-measure

n

F-measure = weighted harmonic mean ofprecision and recall

(at some fixed operating threshold for the classifier)

F1 = 1/ ( 0.5/P + 0.5/R )

= 2PR / (P + R)

Useful as a simple single summary measure of

performance

Sometimes breakeven F1 used, i.e., measuredwhen P = R


15/37

Challenges in text mining

n Semantics:

n Synonymy: The keywords T does notappear anywhere in the document d, even

though the document d is closely related toT (i.e. synonyms of T have been used in d)

n Polysemy: The same keyword may meandifferent things in different contexts, e.g.,

green (colour) Vs green initiatives


16/37

Challenges in text mining

n Data pre-processing

n Stop list: Set of words that are deemed irrelevant, eventhough they may appear frequently

n E.g., a, the, of, for, with, etc.

n Stop lists may vary when document set varies

n Word stem: Several words are small syntactic variants ofeach other since they share a common word stem

n E.g., drug, drugs, drugged

n Phrases: Sometimes it is better to view a group of words asa single unit (like a noun phrase)

n E.g. : data mining, decision support


17/37

Feature Extraction


18/37

Feature ExtractionTask: Extract a good subset of words / phrases to represent documents

Documentcollection

All unique

words/phrases

Feature

Extraction

All good

words/phrases


19/37


20/37

Stemmingn Want to reduce all morphological variants of a word to a

single index termn e.g. a document containing words like fish and fisher may not be

retrieved by a query containing fishing (the word fishing not explicitlycontained in the document)

n Stemming - reduce words to their root formn

e.g. fish becomes a new index term

n Porter stemming algorithm (1980)n relies on a preconstructed suffix list with associated rules

n e.g. if suffix=IZATION and prefix contains at least one vowel followed by aconsonant, replace with suffix=IZE

n BINARIZATION => BINARIZEn Not always desirable: e.g., {university, universal} -> univers (in Porters)

n WordNet: dictionary-based approach


21/37


22/37

Document Representation


23/37

Data Representation

n Document vector / Frequency Matrix /Bag of words (BOW)

n Each document is represented by a vector

n Each dimension of the vector is associatedwith a word/term

n For each document, the value of eachdimension is the frequency of the word

that exists in the document.


24/37

Term Frequency

tf - Term Frequency weighting

wij

= Freqij

= the number of times jth term

occurs in document Di.

Drawback: does mot reflect the importance of the term

for document discrimination.

Ex.ABRTSAQWA

XAO

RTABBAXA

QSAK

D1

D2

A B K O Q R S T W X

D1 3 1 0 1 1 1 1 1 1 1

D2 3 2 1 0 1 1 1 1 0 1


25/37

Example of a document-term matrix

database SQL index regression likelihood linear

d1 24 21 9 0 0 3

d2 32 10 5 0 3 0

d3 12 16 5 0 0 0

d4 6 7 2 0 0 0

d5 43 31 20 0 3 0

d6 2 0 0 18 7 16

d7 0 0 1 32 12 0

d8 3 0 0 22 4 2

d9 1 0 0 34 27 25

d10 6 0 0 17 4 23


26/37

TF-IDF

tfidf - Inverse Document Frequency

wij = Freqij * log(N/ DocFreqj) .

N = the number of documents in the trainingdocument collection.

DocFreqj = the number of documents in the training collection

where the jth term occurs.

Advantage: has reflection of importance factor for

document discrimination.Assumption: terms with low DocFreq are better discriminator

than ones with high DocFreq in document collection

A B K O Q R S T W X

D1 0 0 0 0.3 0 0 0 0 0.3 0

D2 0 0 0.3 0 0 0 0 0 0 0

Ex.


27/37

Term Entropy

( ) ( ))(1*0.1logij iij wentropyFREQw ++=

where

( )=

=

N

j j

ij

j

iji

DOCFREQ

FREQ

DOCFREQ

FREQ

Nwentropy

1

loglog

1)(

is average entropy of jth term; it evaluates to:

-1: if every word occurs once in every document

0: if each word occurs once in only one document


28/37

Similarity Measures


29/37

Similarity measures

n

For various tasks, we need a measure ofsimilarity between documents

n Eucledian distance is a common measure

n Cosine similarity or dot-product is anothermeasure used in text mining:

n Focuses on co-occurrence of wordsn This corresponds to the cosine of the angle between

the two vectors

||||),(

21

2121

vv

vvvvsim

=


30/37

Text Mining Tasks


31/37

T t C t i ti A hit t


32/37

Text Categorization:Architecture

Training

documents preprocessingWeighting

Selecting feature

Predefinedcategories

New document

d

Classifier

Category(ies) tod


33/37

D Cl i


34/37

Document Clustering

Task: It groups all documents so that the documents in the same

group are more similar than ones in other groups.

Cluster hypothesis: relevant documents tend to be more

closely related to each other than to

non-relevant documents.


35/37

Document Clustering Algorithms

k-means Hierarchic Agglomerative Clustering(HAC)

Association Rule Hypergraph Partitioning (ARHP)

SOM / ESOM based clustering


36/37

Latent Semantic Indexing

n Weakness of keyword based techniques

n Lack of semantics

n Cannot identify similar word/concepts without help

n Observationn Words/phrases that represent similar concepts are

usually grouped together

n The most important unit of information for

documents may not be word, but concept instead


37/37

Latent Semantic Indexing

n Latent Semantic Indexing is an attempt toproduce such information

n Start with the term-frequency matrix M

n

Apply singular value decomposition on Mn M = U * S * V

n S = a diagonal matrix of eigenvalues

n U = a square matrix in which each entry capturesimilarity between documents

n V = a square matrix in which each entry capturesimilarity between terms

business intelligence & data mining-11

Documents