data warehousing & mining with business intelligence: principles and algorithms data warehousing...

Data Warehousing & Mining Data Warehousing & Mining with Business Intelligence: with Business Intelligence:

Principles and AlgorithmsPrinciples and Algorithms

Overview Of Text MiningOverview Of Text Mining

11

MotivationMotivation Text mining is well motivated, due to the fact that much of Text mining is well motivated, due to the fact that much of

the world’s data can be found in free text form (newspaper the world’s data can be found in free text form (newspaper articles, emails, literature, etc.). articles, emails, literature, etc.).

While mining free text has the same goals as data mining While mining free text has the same goals as data mining in general (extracting useful knowledge/stats/trends), text in general (extracting useful knowledge/stats/trends), text mining must overcome a major difficulty – there is no mining must overcome a major difficulty – there is no explicit structure.explicit structure.

Machines can reason will relational data well since Machines can reason will relational data well since schemas are explicitly available. Free text, however, schemas are explicitly available. Free text, however, encodes all semantic information within natural language. encodes all semantic information within natural language.

Text mining algorithms, then, must make some sense out Text mining algorithms, then, must make some sense out of this natural language representation. of this natural language representation.

Humans are great at doing this, but this has proved to be a Humans are great at doing this, but this has proved to be a problem for machines.problem for machines. 22

Sources Of dataSources Of data

LettersLetters EmailsEmails Phone Phone

recordingsrecordings ContractsContracts

Technical Technical documentsdocuments

PatentsPatents Web pagesWeb pages ArticlesArticles

33

Text MiningText Mining How does it relate to data mining in general?How does it relate to data mining in general? How does it relate to computational How does it relate to computational

linguistics?linguistics? How does it relate to information retrieval?How does it relate to information retrieval?

Finding Patterns Finding “Nuggets”

Novel Non-Novel

Non-textual data General data-mining

Exploratory Data

Analysis

Database queries

Textual data Computational Linguistics

Information Retrieval

Typical ApplicationsTypical Applications

Summarizing documentsSummarizing documents Discovering/monitoring relations among people, Discovering/monitoring relations among people,

places, organizations, etcplaces, organizations, etc Customer profile analysisCustomer profile analysis Trend analysisTrend analysis Documents summarizationDocuments summarization Spam IdentificationSpam Identification Public health early warningPublic health early warning Event tracksEvent tracks

66

Data Mining / Knowledge Discovery

Structured Data Multimedia Free Text Hypertext

HomeLoan ( Loanee: Frank Ri Lender: MWF Agency:Lake View Amount: $200,000 Term: 15 years )

Frank Rizzo boughthis home from LakeView Real Estate in1992. He paid $200,000under a15-year loanfrom MW Financial.

<a href>Frank Rizzo</a> Bought<a hef>this home</a>from <a href>LakeView Real Estate</a>In <b>1992</b>.<p>...Loans($200K,

[map],...)

Mining Text Data: An Introduction

77

General NLPGeneral NLP——Too Difficult!Too Difficult!

Word-level ambiguity Word-level ambiguity ““design” can be a noun or a verbdesign” can be a noun or a verb (Ambiguous POS) (Ambiguous POS) ““root” has multiple meaningsroot” has multiple meanings (Ambiguous sense) (Ambiguous sense)

Syntactic ambiguitySyntactic ambiguity ““natural language processing” natural language processing” (Modification)(Modification) ““A man saw a boy A man saw a boy with a telescopewith a telescope.”.” (PP Attachment) (PP Attachment)

Anaphora resolutionAnaphora resolution ““John persuaded Bill to buy a TV for John persuaded Bill to buy a TV for himselfhimself.”.”

((himselfhimself = John or Bill?) = John or Bill?) PresuppositionPresupposition

““He has quit smoking.” implies that he smoked before.He has quit smoking.” implies that he smoked before.

Humans rely on context to interpret (when possible).This context may extend beyond a given document!

88

Text Databases and IRText Databases and IR Text databases (document databases) Text databases (document databases)

Large collections of documents from various sources: Large collections of documents from various sources: news articles, research papers, books, digital news articles, research papers, books, digital libraries, e-mail messages, and Web pageslibraries, e-mail messages, and Web pages

Data stored is usually Data stored is usually semi-structuredsemi-structured Traditional IR techniques become inadequate for the Traditional IR techniques become inadequate for the

increasingly vast amounts of text dataincreasingly vast amounts of text data Information retrievalInformation retrieval

A field developed in parallel with database systemsA field developed in parallel with database systems Information is organized into (a large number of) Information is organized into (a large number of)

documentsdocuments Information retrieval problem: locating relevant Information retrieval problem: locating relevant

documents based on user input, such as keywords or documents based on user input, such as keywords or example documentsexample documents

Information RetrievalInformation Retrieval TypicalTypical IR systems IR systems

Online library catalogsOnline library catalogs Online document management systemsOnline document management systems

Information retrieval vs. database systemsInformation retrieval vs. database systems Some DB problems are not present in IR, e.g., Some DB problems are not present in IR, e.g.,

update, transaction management, complex update, transaction management, complex objectsobjects

Some IR problems are not addressed well in Some IR problems are not addressed well in DBMS, e.g., unstructured documents, DBMS, e.g., unstructured documents, approximate search using keywords and approximate search using keywords and relevancerelevance

99

10

Some “Basic” IR Techniques

StemmingStemming

Stop wordsStop words

Weighting of terms (e.g., TF-IDF)Weighting of terms (e.g., TF-IDF)

Vector/Unigram representation of textVector/Unigram representation of text

Text similarity (e.g., cosine, KL-div)Text similarity (e.g., cosine, KL-div)

Relevance/pseudo feedback Relevance/pseudo feedback

1111

Information Retrieval TechniquesInformation Retrieval Techniques Basic ConceptsBasic Concepts

A document can be described by a set of A document can be described by a set of representative keywords called representative keywords called index termsindex terms..

Different index terms Different index terms have varying relevance when varying relevance when used to describe document contents.used to describe document contents.

This effect is captured through the This effect is captured through the assignment of assignment of numerical weights to each index termnumerical weights to each index term of a document. of a document. (e.g.: frequency, tf-idf)(e.g.: frequency, tf-idf)

DBMS AnalogyDBMS Analogy Index Terms Index Terms AttributesAttributes Weights Weights Attribute ValuesAttribute Values

12

Generality of Basic Techniques

Raw text

Term similarity

Doc similarity

Vector centroid

CLUSTERING

d

CATEGORIZATION

META-DATA/ANNOTATION

d d d

d

d d

d

d d d

d d

d d

t t

t t

t t t

t t

t

t t

Stemming & Stop words

Tokenized text

Term Weighting

w11 w12… w1n

w21 w22… w2n

… …wm1 wm2… wmn

t1 t2 … tn

d1

d2 … dm

Sentenceselection

SUMMARIZATION

1313

Basic Measures for Text RetrievalBasic Measures for Text Retrieval

Precision:Precision: the percentage of retrieved documents that are in the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)fact relevant to the query (i.e., “correct” responses)

Recall:Recall: the percentage of documents that are relevant to the percentage of documents that are relevant to the query and were, in fact, retrievedthe query and were, in fact, retrieved

recall=∣{Relevant }∩{Retrieved }∣

∣{Relevant }∣

Relevant Relevant & Retrieved Retrieved

All Documents

precision=∣{Relevant }∩{Retrieved }∣

∣{Retrieved }∣

Information Retrieval TechniquesInformation Retrieval Techniques

Index Terms (Attribute) Selection:Index Terms (Attribute) Selection: Stop listStop list Word stemWord stem Index terms weighting methodsIndex terms weighting methods

Terms Terms Documents Frequency Matrices Documents Frequency Matrices Information Retrieval Models:Information Retrieval Models:

Boolean ModelBoolean Model Vector ModelVector Model Probabilistic ModelProbabilistic Model

1414

Boolean ModelBoolean Model Consider that index terms are either present or Consider that index terms are either present or

absent in a documentabsent in a document As a result, the index term weights are assumed As a result, the index term weights are assumed

to be all binariesto be all binaries A query is composed of index terms linked by A query is composed of index terms linked by

three connectives: three connectives: notnot, , andand, and , and oror e.g.: car e.g.: car andand repair, plane repair, plane oror airplaneairplane

The Boolean model predicts that each document The Boolean model predicts that each document is either relevant or non-relevant based on the is either relevant or non-relevant based on the match of a document to the querymatch of a document to the query

1515

Keyword-Based RetrievalKeyword-Based Retrieval A document is represented by a string, which can be A document is represented by a string, which can be

identified by a set of keywordsidentified by a set of keywords Queries may use Queries may use expressionsexpressions of keywords of keywords

E.g., car E.g., car andand repair shop, tea repair shop, tea oror coffee, DBMS coffee, DBMS but but notnot Oracle Oracle

Queries and retrieval should consider Queries and retrieval should consider synonymssynonyms,, e.g., repair and maintenancee.g., repair and maintenance

Major difficulties of the modelMajor difficulties of the model SynonymySynonymy: A keyword : A keyword TT does not appear does not appear

anywhere in the document, even though the anywhere in the document, even though the document is closely related to document is closely related to TT, e.g., data mining, e.g., data mining

PolysemyPolysemy: The same keyword may mean different : The same keyword may mean different things in different contexts, e.g., mining\things in different contexts, e.g., mining\

1616

Similarity-Based Retrieval in Text DataSimilarity-Based Retrieval in Text Data Finds similar documents based on a set of Finds similar documents based on a set of

common keywordscommon keywords Answer should be based on the degree of Answer should be based on the degree of

relevance based on the nearness of the relevance based on the nearness of the keywords, relative frequency of the keywords, keywords, relative frequency of the keywords, etc.etc.

Basic techniquesBasic techniques Stop listStop list

Set of words that are deemed “irrelevant”, Set of words that are deemed “irrelevant”, even though they may appear frequentlyeven though they may appear frequently

E.g., E.g., a, the, of, for, to, witha, the, of, for, to, with, etc., etc.Stop lists may vary when document set Stop lists may vary when document set

variesvaries1717

Similarity-Based Retrieval in Text DataSimilarity-Based Retrieval in Text Data

Word stemWord stemSeveral words are small syntactic variants of each Several words are small syntactic variants of each

other since they share a common word stemother since they share a common word stemE.g., E.g., drugdrug, , drugs, druggeddrugs, drugged

A term frequency tableA term frequency tableEach entryEach entry frequent_table(i, j) frequent_table(i, j) = # of occurrences of = # of occurrences of

the wordthe word t tii in document in document ddii

Usually, the Usually, the ratioratio instead of the absolute number of instead of the absolute number of occurrences is usedoccurrences is used

Similarity metrics: measure the closeness of a Similarity metrics: measure the closeness of a document to a query (a set of keywords)document to a query (a set of keywords)Relative term occurrencesRelative term occurrencesCosine distance:Cosine distance:

1818

sim(v1 ,v2 )=v1⋅v2

∣v1∣∣v2∣

Feature Extraction: Task(1)Feature Extraction: Task(1)

Task: Extract a good subset of words to represent documents

Document collection

All unique words/phrases

Feature Extraction

All good words/phrases

Feature Extraction:Task Feature Extraction:Task

While more and more textual information is available online, effective retrieval is difficult without good indexing of text content.

TEXT INDEXING TOOLS

Text-information-online-retrieval-index

Feature Extraction

Feature Extraction:IndexingFeature Extraction:Indexing

Identification all unique words

Removal stop wordsRemoval

stop words

Word Stemming

Training documents

Term Weighting•Naive terms•Importance of term in Doc

Removal of suffix to generate word stem grouping words increasing the relevance ex.{walker,walking}walk

non-informative word ex.{the,and,when,more}

Feature Extraction:Weighting Model Feature Extraction:Weighting Model

•tf - Term Frequency weightingwij = Freqij

Freqij : := the number of times jth term occurs in document Di.

Drawback: without reflection of importance factor for document discrimination. •Ex.

ABRTSAQWAXAO

RTABBAXAQSAK

D1

D2

A B K O Q R S T W X

D1 3 1 0 1 1 1 1 1 1 1

D2 3 2 1 0 1 1 1 1 0 1

Feature Extraction:Weighting ModelFeature Extraction:Weighting Model

•tfidf - Inverse Document Frequency weightingwij = Freqij * log(N/ DocFreqj) .N : := the number of documents in the training document collection.DocFreqj ::= the number of documents in which the jth term occurs.

Advantage: with reflection of importance factor for document discrimination. Assumption:terms with low DocFreq are better discriminator than ones with high DocFreq in document collection

A B K O Q R S T W X

D1 0 0 0 0.3 0 0 0 0 0.3 0

D2 0 0 0.3 0 0 0 0 0 0 0

•Ex.

Indexing TechniquesIndexing Techniques Inverted indexInverted index

Maintains two hash- or B+-tree indexed tables: Maintains two hash- or B+-tree indexed tables: document_tabledocument_table: a set of document records <doc_id, : a set of document records <doc_id,

postings_list> postings_list> term_tableterm_table: a set of term records, <term, postings_list>: a set of term records, <term, postings_list>

Answer query: Find all docs associated with one or a set of termsAnswer query: Find all docs associated with one or a set of terms + easy to implement+ easy to implement – – do not handle well synonymy and polysemy, and posting lists do not handle well synonymy and polysemy, and posting lists

could be too long (storage could be very large)could be too long (storage could be very large) Signature fileSignature file

Associate a signature with each documentAssociate a signature with each document A signature is a representation of an ordered list of terms that A signature is a representation of an ordered list of terms that

describe the documentdescribe the document Order is obtained by frequency analysis, stemming and stop listsOrder is obtained by frequency analysis, stemming and stop lists

2424

Latent Semantic IndexingLatent Semantic Indexing

Similar documents have similar word Similar documents have similar word frequenciesfrequencies

Difficulty: the size of the term frequency matrix Difficulty: the size of the term frequency matrix is very largeis very large

Use a singular value decomposition (SVD) Use a singular value decomposition (SVD) techniques to reduce the size of frequency tabletechniques to reduce the size of frequency table

Retain the Retain the KK most significant rows of the most significant rows of the frequency tablefrequency table

2525

Probabilistic ModelProbabilistic Model Basic assumption: Given a user query, there is a set of Basic assumption: Given a user query, there is a set of

documents which contains exactly the relevant documents which contains exactly the relevant documents and no other (ideal answer set)documents and no other (ideal answer set)

Querying process as a process of specifying the Querying process as a process of specifying the properties of an ideal answer set. Since these properties properties of an ideal answer set. Since these properties are not known at query time, an initial guess is madeare not known at query time, an initial guess is made

This initial guess allows the generation of a preliminary This initial guess allows the generation of a preliminary probabilistic description of the ideal answer set which is probabilistic description of the ideal answer set which is used to retrieve the first set of documentsused to retrieve the first set of documents

An interaction with the user is then initiated with the An interaction with the user is then initiated with the purpose of improving the probabilistic description of the purpose of improving the probabilistic description of the answer setanswer set

2626

Dimension Reduction:DocFreq ThresholdingDimension Reduction:DocFreq Thresholding

Calculates DocFreq(w)

Sets threshold

Removes all words:DocFreq <

Naive TermsTraining documents D

Feature Terms

Types of Text Data MiningTypes of Text Data Mining Keyword-based association analysisKeyword-based association analysis Automatic document classificationAutomatic document classification Similarity detectionSimilarity detection

Cluster documents by a common authorCluster documents by a common author Cluster documents containing information from a Cluster documents containing information from a

common source common source Link analysis: unusual correlation between entitiesLink analysis: unusual correlation between entities Sequence analysis: predicting a recurring eventSequence analysis: predicting a recurring event Anomaly detection: find information that violates usual Anomaly detection: find information that violates usual

patterns patterns Hypertext analysisHypertext analysis

Patterns in anchors/linksPatterns in anchors/linksAnchor text correlations with linked objectsAnchor text correlations with linked objects

2828

Keyword-Based Association AnalysisKeyword-Based Association Analysis Motivation: Collect sets of keywords or terms that occur Motivation: Collect sets of keywords or terms that occur

frequently together and then find the frequently together and then find the associationassociation or or correlation correlation relationships among themrelationships among them

Association Analysis Process: Preprocess the text data by Association Analysis Process: Preprocess the text data by parsing, stemming, removing stop words, etc.parsing, stemming, removing stop words, etc.

Evoke association mining algorithms: Consider each Evoke association mining algorithms: Consider each document as a transaction & View a set of keywords document as a transaction & View a set of keywords in the document as a set of items in the transactionin the document as a set of items in the transaction

Term level association miningTerm level association miningNo need for human effort in tagging documentsNo need for human effort in tagging documentsThe number of meaningless results and the The number of meaningless results and the

execution time is greatly reducedexecution time is greatly reduced2929

Text ClassificationText Classification Automatic classification for the large number of on-line text Automatic classification for the large number of on-line text

documents (Web pages, e-mails, intranets, etc.) documents (Web pages, e-mails, intranets, etc.) Classification ProcessClassification Process

Data preprocessingData preprocessing Definition of training set and test setsDefinition of training set and test sets Creation of the classification model using the selected Creation of the classification model using the selected

classification algorithmclassification algorithm Classification model validationClassification model validation Classification of new/unknown text documentsClassification of new/unknown text documents

Text document classification differs from the classification of Text document classification differs from the classification of relational datarelational data Document databases are not structured according to Document databases are not structured according to

attribute-value pairsattribute-value pairs3030

Text Classification(2)Text Classification(2)

Classification Classification Algorithms:Algorithms: Support Vector Support Vector

MachinesMachines K-NNK-NN Naïve BayesNaïve Bayes Neural NetworksNeural Networks Decision TreesDecision Trees Association rule-Association rule-

based Boostingbased Boosting

3131

Text Classification: An ExampleText Classification: An Example

Ex# Hooligan

1 An English football fan …

Yes

2 During a game in Italy …

Yes

3 England has been beating France …

Yes

4 Italian football fans were cheering …

No

5 An average USA salesman earns 75K

No

6 The game in London was horrific

Yes

7 Manchester city is likely to win the championship

Yes

8 Rome is taking the lead in the football league

Yes 10

clas

s

Training Set

ModelLearn

Classifier

text

TestSet

Hooligan

A Danish football fan ?

Turkey is playing vs. France. The Turkish fans …

? 10

Document ClusteringDocument Clustering MotivationMotivation

Automatically group related documents based on their Automatically group related documents based on their contentscontents

No predetermined training sets or taxonomiesNo predetermined training sets or taxonomies Generate a taxonomy at runtimeGenerate a taxonomy at runtime

Clustering ProcessClustering Process Data preprocessing: remove stop words, stem, Data preprocessing: remove stop words, stem,

feature extraction, lexical analysis, etc.feature extraction, lexical analysis, etc. Hierarchical clustering: compute similarities applying Hierarchical clustering: compute similarities applying

clustering algorithms.clustering algorithms. Model-Based clustering (Neural Network Approach): Model-Based clustering (Neural Network Approach):

clusters are represented by “exemplars”. (e.g.: SOM)clusters are represented by “exemplars”. (e.g.: SOM)3333

Document Clustering :k-meansDocument Clustering :k-means

3434

0. Input: 0. Input: DD::={d::={d11,d,d22,…d,…dn n }; }; kk::=the cluster number;::=the cluster number;

1. Select k document vectors as the initial centriods of 1. Select k document vectors as the initial centriods of k clusters k clusters 2. Repeat2. Repeat3. Select one vector 3. Select one vector dd in remaining documents in remaining documents4. Compute similarities between d and 4. Compute similarities between d and kk centroids centroids5. Put 5. Put dd in the closest cluster and recompute the in the closest cluster and recompute the centroid centroid 6. Until the centroids don’t change6. Until the centroids don’t change7. Output:7. Output:kk clusters of documents clusters of documentsCan similarly extened Hierarchical clustering Can similarly extened Hierarchical clustering algorithms to Text case too.algorithms to Text case too.

3535

Text CategorizationText Categorization Pre-given categories and labeled document Pre-given categories and labeled document

examples (Categories may form hierarchy)examples (Categories may form hierarchy) Classify new documents Classify new documents A standard classification (supervised A standard classification (supervised

learning ) problemlearning ) problem

CategorizationSystem

…

Sports

Business

Education

Science…

SportsBusiness

Education

ApplicationsApplications News article classificationNews article classification Automatic email filteringAutomatic email filtering Webpage classificationWebpage classification Word sense disambiguationWord sense disambiguation … …… …

3636

Categorization: ArchitectureCategorization: Architecture

Training documents

preprocessingWeighting

Selecting feature

Predefinedcategories

New document

d

Classifier

Category(ies) to d

Categorization ClassifiersCategorization Classifiers Centroid-based ClassifierCentroid-based Classifier

k-Nearest Neighbor Classifierk-Nearest Neighbor Classifier

Naive Bayes ClassifierNaive Bayes Classifier

Model:Centroid-Based ClassifierModel:Centroid-Based Classifier 1.Input:new document 1.Input:new document d d =(=(ww11, , ww22,…,,…,wwnn););

2.Predefined categories:C={c2.Predefined categories:C={c11,c,c22,….,c,….,cll};}; 3.//Compute centroid vector3.//Compute centroid vector

c⃗ i=∑d'∈c i

d'

∣ci∣

, ciC4.Similarity model - cosine function

Simil (d i ,d j )=cos(d i ,d j )=d i⋅d j

∥d i∥2×∥d j∥2

=∑ w il×w jl

√∑ w2 il×√∑ w

2 jl

5.Compute similarity Sim il ( c⃗ i ,d )= cos ( c⃗ i ,d )

6.Output:Assign to document d the category cmax

Sim il ( c⃗ i ,d )≤ Sim il ( cmax ,d )

Model: K-Nearest Neighbor ClassifierModel: K-Nearest Neighbor Classifier

11.Input.Input:new document :new document dd;;

2.training collection:D={d2.training collection:D={d11,d,d22,…d,…dn n };};

3.predefined categories:C={c3.predefined categories:C={c11,c,c22,….,c,….,cll};};

4.//Compute similarities for(d4.//Compute similarities for(diiD){ Simil(D){ Simil(dd,d,dii) =cos() =cos(dd,,ddii); }); }

5.//Select k-nearest neighbor5.//Select k-nearest neighbor

Construct k-document subset DConstruct k-document subset Dkk so that so that

Simil(Simil(dd,d,dii) < min(Simil() < min(Simil(dd,doc) | doc ,doc) | doc DDkk) ) ddi i D- DD- Dk.k.

6.//Compute score for each category6.//Compute score for each category

for(cfor(ciiC){ score(cC){ score(cii)=0;)=0;

for(docfor(docDDkk){ score(c){ score(cii)+=((doc)+=((docccii)=true?1:0)} })=true?1:0)} }

7.7.OutputOutput:Assign to :Assign to dd the category the category cc with with highest scorehighest score

Categorization MethodsCategorization Methods Manual: Typically rule-based Manual: Typically rule-based

Does not scale up (labor-intensive, rule inconsistency)Does not scale up (labor-intensive, rule inconsistency) May be appropriate for special data on a particular May be appropriate for special data on a particular

domaindomain Automatic: Typically exploiting machine learning Automatic: Typically exploiting machine learning

techniquestechniques Vector space model basedVector space model based

Prototype-based (Rocchio)Prototype-based (Rocchio) K-nearest neighbor (KNN)K-nearest neighbor (KNN) Decision-tree (learn rules)Decision-tree (learn rules) Neural Networks (learn non-linear classifier)Neural Networks (learn non-linear classifier) Support Vector Machines (SVM)Support Vector Machines (SVM)

Probabilistic or generative model basedProbabilistic or generative model based Naïve Bayes classifier Naïve Bayes classifier

4141

4242

Vector Space ModelVector Space Model

Represent a doc by a term vectorRepresent a doc by a term vector Term: basic concept, e.g., word or phraseTerm: basic concept, e.g., word or phrase

Each term defines one dimensionEach term defines one dimension

N terms define a N-dimensional spaceN terms define a N-dimensional space

Element of vector corresponds to term weightElement of vector corresponds to term weight

E.g., d = (xE.g., d = (x11,…,x,…,xNN), x), xii is “importance” of term i is “importance” of term i

New document is assigned to the most likely New document is assigned to the most likely

category based on vector similarity. category based on vector similarity.

4343

VS Model: IllustrationVS Model: Illustration

Java

Microsoft

Starbucks

C1 Category 1

C3

Category 3

new doc

4444

How to Assign WeightsHow to Assign Weights Two-fold heuristics based on frequencyTwo-fold heuristics based on frequency

TF (Term frequency)TF (Term frequency)More frequent More frequent withinwithin a document a document more relevant more relevant

to semanticsto semanticse.g., “query” vs. “commercial”e.g., “query” vs. “commercial”

IDF (Inverse document frequency)IDF (Inverse document frequency)Less frequentLess frequent among among documents documents more more

discriminativediscriminativee.g. “algebra” vs. “science”e.g. “algebra” vs. “science”

4545

TF WeightingTF Weighting Weighting:Weighting:

More frequent => more relevant to topicMore frequent => more relevant to topice.g. “query” vs. “commercial”e.g. “query” vs. “commercial”Raw TF= f(Raw TF= f(t,dt,d): how many times term): how many times term t t appears in appears in

doc doc dd Normalization:Normalization:

Document length varies => relative frequency preferredDocument length varies => relative frequency preferrede.g., Maximum frequency normalizatione.g., Maximum frequency normalization

4646

IDF WeightingIDF Weighting Ideas:Ideas:

Less frequentLess frequent among among documents documents more more discriminativediscriminative

Formula:Formula:

n — total number of docs n — total number of docs k — # docs with term t k — # docs with term t appearing appearing

(the DF document frequency)(the DF document frequency)

4747

TF-IDF WeightingTF-IDF Weighting TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t)TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t)

Freqent within doc Freqent within doc high tf high tf high weight high weight Selective among docs Selective among docs high idf high idf high weight high weight

Recall VS modelRecall VS model Each selected term represents one dimensionEach selected term represents one dimension Each doc is represented by a feature vectorEach doc is represented by a feature vector Its Its tt-term coordinate of document -term coordinate of document dd is the TF-IDF is the TF-IDF

weightweight This is more reasonableThis is more reasonable

Just for illustration Just for illustration …… Many complex and more effective weighting variants Many complex and more effective weighting variants

exist in practiceexist in practice

4848

How to Measure Similarity?How to Measure Similarity? Given two documentGiven two document

Similarity definitionSimilarity definition

dot productdot product

normalized dot product (or cosine)normalized dot product (or cosine)

4949

Illustrative ExampleIllustrative Example

text mining travel map search engine govern president congressIDF(faked) 2.4 4.5 2.8 3.3 2.1 5.4 2.2 3.2 4.3

doc1 2(4.8) 1(4.5) 1(2.1) 1(5.4)doc2 1(2.4 ) 2 (5.6) 1(3.3) doc3 1 (2.2) 1(3.2) 1(4.3)

newdoc 1(2.4) 1(4.5)

doc3

text miningsearchengine

text

traveltext

maptravel

government presidentcongress

doc1

doc2

……

To whom is newdoc more similar?

Sim(newdoc,doc1)=4.8*2.4+4.5*4.5

Sim(newdoc,doc2)=2.4*2.4

Sim(newdoc,doc3)=0

5050

Probabilistic ModelProbabilistic Model

Category Category CC is modeled as a probability is modeled as a probability

distribution of pre-defined random eventsdistribution of pre-defined random events

Random events model the process of generating Random events model the process of generating

documentsdocuments

Therefore, how likely a document Therefore, how likely a document dd belongs to belongs to

category category C C is measured through the probability is measured through the probability

for category for category CC to generate to generate dd..

5151

EvaluationsEvaluations Effectiveness measureEffectiveness measure

PrecisionPrecision

RecallRecall

5252

Evaluation (con’t)Evaluation (con’t) BenchmarksBenchmarks

Classic: Reuters collectionClassic: Reuters collection A set of newswire stories classified under categories related to A set of newswire stories classified under categories related to

economics.economics. EffectivenessEffectiveness

Difficulties of strict comparisonDifficulties of strict comparison different parameter settingdifferent parameter setting different “split” (or selection) between training and testingdifferent “split” (or selection) between training and testing various optimizations … …various optimizations … …

However widely recognizableHowever widely recognizable Best: Boosting-based committee classifier & SVMBest: Boosting-based committee classifier & SVM Worst: Naïve Bayes classifier Worst: Naïve Bayes classifier

Need to consider other factors, especially efficiencyNeed to consider other factors, especially efficiency

5353

Summary: Text CategorizationSummary: Text Categorization

Wide application domainWide application domain

Comparable effectiveness to professionalsComparable effectiveness to professionals

Manual TC is not 100% and unlikely to improve Manual TC is not 100% and unlikely to improve

substantially. substantially.

A.T.C. is growing at a steady paceA.T.C. is growing at a steady pace

Prospects and extensionsProspects and extensions

Very noisy text, such as text from O.C.R.Very noisy text, such as text from O.C.R.

Speech transcriptsSpeech transcripts

data warehousing & mining with business intelligence: principles and algorithms data warehousing...

Documents

mining text data

text mining algorithms

motivation text mining

relational data

worlds data

introduction slide

semantic information

lake view real estate