business intelligence & data mining-11
TRANSCRIPT
-
8/10/2019 Business Intelligence & Data Mining-11
1/37
Text Mining
-
8/10/2019 Business Intelligence & Data Mining-11
2/37
Motivation for Text Mining
n Approximately 90% of the worlds data is held inunstructured formats (source: Oracle Corporation)
n Information intensive business processes demandthat we transcend from data mining to knowledgediscovery in text.
90%
Structured Numerical or Coded
Information10%
Unstructured or Semi-structured
Information
-
8/10/2019 Business Intelligence & Data Mining-11
3/37
Text Databases
n Large collections of documents from varioussources: news articles, research papers,books, digital libraries, e-mail messages, and
Web pages, library databasesn Properties
n Unstructured in general (some semi-structured
e.g. XML)
n Semantics, not only syntax, is important
n Non-numeric in nature
-
8/10/2019 Business Intelligence & Data Mining-11
4/37
Challenges of Text Mining
n Very high number of possible dimensionsn All possible word and phrase types in the language!!
n Unlike data mining:n records (= docs) are not structurally identical
n records are not statistically independent
n Complex and subtle relationships betweenconcepts in textn AOL merges with Time-Warner
n Time-Warner is bought by AOL
n Ambiguity and context sensitivityn automobile = car = vehicle = Toyota
n Apple (the company) or apple (the fruit)
-
8/10/2019 Business Intelligence & Data Mining-11
5/37
Technological Advances in Text Mining
nAdvances in text processing technology
n Natural Language Processing (NLP)
n New algorithms
n Cheap Hardware!
n CPU
n Disk
n
Network
-
8/10/2019 Business Intelligence & Data Mining-11
6/37
Data Mining Vs Text Mining &Search versus Discover
Data
Mining
TextMining
Data
Retrieval
DocumentRetrieval
Search
(goal-oriented)
Discover
Structured
Data
Unstructured
Data (Text)
-
8/10/2019 Business Intelligence & Data Mining-11
7/37
Text Databases and
Information Retrieval
n Information / document retrievaln Traditional study of how to retrieve information
from text documentsn Information is organized into (a large number of)
documents
n Information retrieval problem: locating relevantdocuments based on user input, such askeywords, queries or example documents
-
8/10/2019 Business Intelligence & Data Mining-11
8/37
Text Database andInformation Retrieval
n Typical IR systemsn Online library catalogs
n Online document management systems
n
Information retrieval vs. database systemsn Some DB problems are not present in IR, e.g.,
update, transaction management, complexobjects
n Some IR problems are not addressed well inDBMS, e.g., unstructured documents, approximatesearch using keywords and relevance
-
8/10/2019 Business Intelligence & Data Mining-11
9/37
-
8/10/2019 Business Intelligence & Data Mining-11
10/37
What is Text Mining?
Text Mining is the process of synthesizing information
by analyzing the relations, patterns, and rules within
textual data: semi-structured or unstructured text.
The Tool BoxData mining algorithms
Machine learning techniques
Document / information retrieval concepts
Statistical techniques
Natural-language processing
-
8/10/2019 Business Intelligence & Data Mining-11
11/37
Potential Applications
Customer comment analysis
Trend analysis
Information filtering and routing
Event tracking
News story classification
Web search
Sentiment analysis
.
-
8/10/2019 Business Intelligence & Data Mining-11
12/37
Basic Measures: Precision & Recall
n Precision: the percentage of retrieved documents that are in factrelevant to the query (i.e., correct responses)
n Recall: the percentage of documents that are relevant to the queryand were, in fact, retrieved
|}{|
|}{}{|
Relevant
RetrievedRelevantrecall
=
|}{|
|}{}{|
Retrieved
etrievedRelevantprecision
=
-
8/10/2019 Business Intelligence & Data Mining-11
13/37
Precision decreases as Recallincreases
SVMDecision TreeSOMLogistic Regr
Precision
RecallDumais
(1998)
-
8/10/2019 Business Intelligence & Data Mining-11
14/37
F-measure
n
F-measure = weighted harmonic mean ofprecision and recall
(at some fixed operating threshold for the classifier)
F1 = 1/ ( 0.5/P + 0.5/R )
= 2PR / (P + R)
Useful as a simple single summary measure of
performance
Sometimes breakeven F1 used, i.e., measuredwhen P = R
-
8/10/2019 Business Intelligence & Data Mining-11
15/37
Challenges in text mining
n Semantics:
n Synonymy: The keywords T does notappear anywhere in the document d, even
though the document d is closely related toT (i.e. synonyms of T have been used in d)
n Polysemy: The same keyword may meandifferent things in different contexts, e.g.,
green (colour) Vs green initiatives
-
8/10/2019 Business Intelligence & Data Mining-11
16/37
Challenges in text mining
n Data pre-processing
n Stop list: Set of words that are deemed irrelevant, eventhough they may appear frequently
n E.g., a, the, of, for, with, etc.
n Stop lists may vary when document set varies
n Word stem: Several words are small syntactic variants ofeach other since they share a common word stem
n E.g., drug, drugs, drugged
n Phrases: Sometimes it is better to view a group of words asa single unit (like a noun phrase)
n E.g. : data mining, decision support
-
8/10/2019 Business Intelligence & Data Mining-11
17/37
Feature Extraction
-
8/10/2019 Business Intelligence & Data Mining-11
18/37
Feature ExtractionTask: Extract a good subset of words / phrases to represent documents
Documentcollection
All unique
words/phrases
Feature
Extraction
All good
words/phrases
-
8/10/2019 Business Intelligence & Data Mining-11
19/37
-
8/10/2019 Business Intelligence & Data Mining-11
20/37
Stemmingn Want to reduce all morphological variants of a word to a
single index termn e.g. a document containing words like fish and fisher may not be
retrieved by a query containing fishing (the word fishing not explicitlycontained in the document)
n Stemming - reduce words to their root formn
e.g. fish becomes a new index term
n Porter stemming algorithm (1980)n relies on a preconstructed suffix list with associated rules
n e.g. if suffix=IZATION and prefix contains at least one vowel followed by aconsonant, replace with suffix=IZE
n BINARIZATION => BINARIZEn Not always desirable: e.g., {university, universal} -> univers (in Porters)
n WordNet: dictionary-based approach
-
8/10/2019 Business Intelligence & Data Mining-11
21/37
-
8/10/2019 Business Intelligence & Data Mining-11
22/37
Document Representation
-
8/10/2019 Business Intelligence & Data Mining-11
23/37
Data Representation
n Document vector / Frequency Matrix /Bag of words (BOW)
n Each document is represented by a vector
n Each dimension of the vector is associatedwith a word/term
n For each document, the value of eachdimension is the frequency of the word
that exists in the document.
-
8/10/2019 Business Intelligence & Data Mining-11
24/37
Term Frequency
tf - Term Frequency weighting
wij
= Freqij
= the number of times jth term
occurs in document Di.
Drawback: does mot reflect the importance of the term
for document discrimination.
Ex.ABRTSAQWA
XAO
RTABBAXA
QSAK
D1
D2
A B K O Q R S T W X
D1 3 1 0 1 1 1 1 1 1 1
D2 3 2 1 0 1 1 1 1 0 1
-
8/10/2019 Business Intelligence & Data Mining-11
25/37
Example of a document-term matrix
database SQL index regression likelihood linear
d1 24 21 9 0 0 3
d2 32 10 5 0 3 0
d3 12 16 5 0 0 0
d4 6 7 2 0 0 0
d5 43 31 20 0 3 0
d6 2 0 0 18 7 16
d7 0 0 1 32 12 0
d8 3 0 0 22 4 2
d9 1 0 0 34 27 25
d10 6 0 0 17 4 23
-
8/10/2019 Business Intelligence & Data Mining-11
26/37
TF-IDF
tfidf - Inverse Document Frequency
wij = Freqij * log(N/ DocFreqj) .
N = the number of documents in the trainingdocument collection.
DocFreqj = the number of documents in the training collection
where the jth term occurs.
Advantage: has reflection of importance factor for
document discrimination.Assumption: terms with low DocFreq are better discriminator
than ones with high DocFreq in document collection
A B K O Q R S T W X
D1 0 0 0 0.3 0 0 0 0 0.3 0
D2 0 0 0.3 0 0 0 0 0 0 0
Ex.
-
8/10/2019 Business Intelligence & Data Mining-11
27/37
Term Entropy
( ) ( ))(1*0.1logij iij wentropyFREQw ++=
where
( )=
=
N
j j
ij
j
iji
DOCFREQ
FREQ
DOCFREQ
FREQ
Nwentropy
1
loglog
1)(
is average entropy of jth term; it evaluates to:
-1: if every word occurs once in every document
0: if each word occurs once in only one document
-
8/10/2019 Business Intelligence & Data Mining-11
28/37
Similarity Measures
-
8/10/2019 Business Intelligence & Data Mining-11
29/37
Similarity measures
n
For various tasks, we need a measure ofsimilarity between documents
n Eucledian distance is a common measure
n Cosine similarity or dot-product is anothermeasure used in text mining:
n Focuses on co-occurrence of wordsn This corresponds to the cosine of the angle between
the two vectors
||||),(
21
2121
vv
vvvvsim
=
-
8/10/2019 Business Intelligence & Data Mining-11
30/37
Text Mining Tasks
-
8/10/2019 Business Intelligence & Data Mining-11
31/37
T t C t i ti A hit t
-
8/10/2019 Business Intelligence & Data Mining-11
32/37
Text Categorization:Architecture
Training
documents preprocessingWeighting
Selecting feature
Predefinedcategories
New document
d
Classifier
Category(ies) tod
-
8/10/2019 Business Intelligence & Data Mining-11
33/37
D Cl i
-
8/10/2019 Business Intelligence & Data Mining-11
34/37
Document Clustering
Task: It groups all documents so that the documents in the same
group are more similar than ones in other groups.
Cluster hypothesis: relevant documents tend to be more
closely related to each other than to
non-relevant documents.
-
8/10/2019 Business Intelligence & Data Mining-11
35/37
Document Clustering Algorithms
k-means Hierarchic Agglomerative Clustering(HAC)
Association Rule Hypergraph Partitioning (ARHP)
SOM / ESOM based clustering
-
8/10/2019 Business Intelligence & Data Mining-11
36/37
Latent Semantic Indexing
n Weakness of keyword based techniques
n Lack of semantics
n Cannot identify similar word/concepts without help
n Observationn Words/phrases that represent similar concepts are
usually grouped together
n The most important unit of information for
documents may not be word, but concept instead
-
8/10/2019 Business Intelligence & Data Mining-11
37/37
Latent Semantic Indexing
n Latent Semantic Indexing is an attempt toproduce such information
n Start with the term-frequency matrix M
n
Apply singular value decomposition on Mn M = U * S * V
n S = a diagonal matrix of eigenvalues
n U = a square matrix in which each entry capturesimilarity between documents
n V = a square matrix in which each entry capturesimilarity between terms