business intelligence & data mining-11

Upload: binzidd007

Post on 02-Jun-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Business Intelligence & Data Mining-11

    1/37

    Text Mining

  • 8/10/2019 Business Intelligence & Data Mining-11

    2/37

    Motivation for Text Mining

    n Approximately 90% of the worlds data is held inunstructured formats (source: Oracle Corporation)

    n Information intensive business processes demandthat we transcend from data mining to knowledgediscovery in text.

    90%

    Structured Numerical or Coded

    Information10%

    Unstructured or Semi-structured

    Information

  • 8/10/2019 Business Intelligence & Data Mining-11

    3/37

    Text Databases

    n Large collections of documents from varioussources: news articles, research papers,books, digital libraries, e-mail messages, and

    Web pages, library databasesn Properties

    n Unstructured in general (some semi-structured

    e.g. XML)

    n Semantics, not only syntax, is important

    n Non-numeric in nature

  • 8/10/2019 Business Intelligence & Data Mining-11

    4/37

    Challenges of Text Mining

    n Very high number of possible dimensionsn All possible word and phrase types in the language!!

    n Unlike data mining:n records (= docs) are not structurally identical

    n records are not statistically independent

    n Complex and subtle relationships betweenconcepts in textn AOL merges with Time-Warner

    n Time-Warner is bought by AOL

    n Ambiguity and context sensitivityn automobile = car = vehicle = Toyota

    n Apple (the company) or apple (the fruit)

  • 8/10/2019 Business Intelligence & Data Mining-11

    5/37

    Technological Advances in Text Mining

    nAdvances in text processing technology

    n Natural Language Processing (NLP)

    n New algorithms

    n Cheap Hardware!

    n CPU

    n Disk

    n

    Network

  • 8/10/2019 Business Intelligence & Data Mining-11

    6/37

    Data Mining Vs Text Mining &Search versus Discover

    Data

    Mining

    TextMining

    Data

    Retrieval

    DocumentRetrieval

    Search

    (goal-oriented)

    Discover

    Structured

    Data

    Unstructured

    Data (Text)

  • 8/10/2019 Business Intelligence & Data Mining-11

    7/37

    Text Databases and

    Information Retrieval

    n Information / document retrievaln Traditional study of how to retrieve information

    from text documentsn Information is organized into (a large number of)

    documents

    n Information retrieval problem: locating relevantdocuments based on user input, such askeywords, queries or example documents

  • 8/10/2019 Business Intelligence & Data Mining-11

    8/37

    Text Database andInformation Retrieval

    n Typical IR systemsn Online library catalogs

    n Online document management systems

    n

    Information retrieval vs. database systemsn Some DB problems are not present in IR, e.g.,

    update, transaction management, complexobjects

    n Some IR problems are not addressed well inDBMS, e.g., unstructured documents, approximatesearch using keywords and relevance

  • 8/10/2019 Business Intelligence & Data Mining-11

    9/37

  • 8/10/2019 Business Intelligence & Data Mining-11

    10/37

    What is Text Mining?

    Text Mining is the process of synthesizing information

    by analyzing the relations, patterns, and rules within

    textual data: semi-structured or unstructured text.

    The Tool BoxData mining algorithms

    Machine learning techniques

    Document / information retrieval concepts

    Statistical techniques

    Natural-language processing

  • 8/10/2019 Business Intelligence & Data Mining-11

    11/37

    Potential Applications

    Customer comment analysis

    Trend analysis

    Information filtering and routing

    Event tracking

    News story classification

    Web search

    Sentiment analysis

    .

  • 8/10/2019 Business Intelligence & Data Mining-11

    12/37

    Basic Measures: Precision & Recall

    n Precision: the percentage of retrieved documents that are in factrelevant to the query (i.e., correct responses)

    n Recall: the percentage of documents that are relevant to the queryand were, in fact, retrieved

    |}{|

    |}{}{|

    Relevant

    RetrievedRelevantrecall

    =

    |}{|

    |}{}{|

    Retrieved

    etrievedRelevantprecision

    =

  • 8/10/2019 Business Intelligence & Data Mining-11

    13/37

    Precision decreases as Recallincreases

    SVMDecision TreeSOMLogistic Regr

    Precision

    RecallDumais

    (1998)

  • 8/10/2019 Business Intelligence & Data Mining-11

    14/37

    F-measure

    n

    F-measure = weighted harmonic mean ofprecision and recall

    (at some fixed operating threshold for the classifier)

    F1 = 1/ ( 0.5/P + 0.5/R )

    = 2PR / (P + R)

    Useful as a simple single summary measure of

    performance

    Sometimes breakeven F1 used, i.e., measuredwhen P = R

  • 8/10/2019 Business Intelligence & Data Mining-11

    15/37

    Challenges in text mining

    n Semantics:

    n Synonymy: The keywords T does notappear anywhere in the document d, even

    though the document d is closely related toT (i.e. synonyms of T have been used in d)

    n Polysemy: The same keyword may meandifferent things in different contexts, e.g.,

    green (colour) Vs green initiatives

  • 8/10/2019 Business Intelligence & Data Mining-11

    16/37

    Challenges in text mining

    n Data pre-processing

    n Stop list: Set of words that are deemed irrelevant, eventhough they may appear frequently

    n E.g., a, the, of, for, with, etc.

    n Stop lists may vary when document set varies

    n Word stem: Several words are small syntactic variants ofeach other since they share a common word stem

    n E.g., drug, drugs, drugged

    n Phrases: Sometimes it is better to view a group of words asa single unit (like a noun phrase)

    n E.g. : data mining, decision support

  • 8/10/2019 Business Intelligence & Data Mining-11

    17/37

    Feature Extraction

  • 8/10/2019 Business Intelligence & Data Mining-11

    18/37

    Feature ExtractionTask: Extract a good subset of words / phrases to represent documents

    Documentcollection

    All unique

    words/phrases

    Feature

    Extraction

    All good

    words/phrases

  • 8/10/2019 Business Intelligence & Data Mining-11

    19/37

  • 8/10/2019 Business Intelligence & Data Mining-11

    20/37

    Stemmingn Want to reduce all morphological variants of a word to a

    single index termn e.g. a document containing words like fish and fisher may not be

    retrieved by a query containing fishing (the word fishing not explicitlycontained in the document)

    n Stemming - reduce words to their root formn

    e.g. fish becomes a new index term

    n Porter stemming algorithm (1980)n relies on a preconstructed suffix list with associated rules

    n e.g. if suffix=IZATION and prefix contains at least one vowel followed by aconsonant, replace with suffix=IZE

    n BINARIZATION => BINARIZEn Not always desirable: e.g., {university, universal} -> univers (in Porters)

    n WordNet: dictionary-based approach

  • 8/10/2019 Business Intelligence & Data Mining-11

    21/37

  • 8/10/2019 Business Intelligence & Data Mining-11

    22/37

    Document Representation

  • 8/10/2019 Business Intelligence & Data Mining-11

    23/37

    Data Representation

    n Document vector / Frequency Matrix /Bag of words (BOW)

    n Each document is represented by a vector

    n Each dimension of the vector is associatedwith a word/term

    n For each document, the value of eachdimension is the frequency of the word

    that exists in the document.

  • 8/10/2019 Business Intelligence & Data Mining-11

    24/37

    Term Frequency

    tf - Term Frequency weighting

    wij

    = Freqij

    = the number of times jth term

    occurs in document Di.

    Drawback: does mot reflect the importance of the term

    for document discrimination.

    Ex.ABRTSAQWA

    XAO

    RTABBAXA

    QSAK

    D1

    D2

    A B K O Q R S T W X

    D1 3 1 0 1 1 1 1 1 1 1

    D2 3 2 1 0 1 1 1 1 0 1

  • 8/10/2019 Business Intelligence & Data Mining-11

    25/37

    Example of a document-term matrix

    database SQL index regression likelihood linear

    d1 24 21 9 0 0 3

    d2 32 10 5 0 3 0

    d3 12 16 5 0 0 0

    d4 6 7 2 0 0 0

    d5 43 31 20 0 3 0

    d6 2 0 0 18 7 16

    d7 0 0 1 32 12 0

    d8 3 0 0 22 4 2

    d9 1 0 0 34 27 25

    d10 6 0 0 17 4 23

  • 8/10/2019 Business Intelligence & Data Mining-11

    26/37

    TF-IDF

    tfidf - Inverse Document Frequency

    wij = Freqij * log(N/ DocFreqj) .

    N = the number of documents in the trainingdocument collection.

    DocFreqj = the number of documents in the training collection

    where the jth term occurs.

    Advantage: has reflection of importance factor for

    document discrimination.Assumption: terms with low DocFreq are better discriminator

    than ones with high DocFreq in document collection

    A B K O Q R S T W X

    D1 0 0 0 0.3 0 0 0 0 0.3 0

    D2 0 0 0.3 0 0 0 0 0 0 0

    Ex.

  • 8/10/2019 Business Intelligence & Data Mining-11

    27/37

    Term Entropy

    ( ) ( ))(1*0.1logij iij wentropyFREQw ++=

    where

    ( )=

    =

    N

    j j

    ij

    j

    iji

    DOCFREQ

    FREQ

    DOCFREQ

    FREQ

    Nwentropy

    1

    loglog

    1)(

    is average entropy of jth term; it evaluates to:

    -1: if every word occurs once in every document

    0: if each word occurs once in only one document

  • 8/10/2019 Business Intelligence & Data Mining-11

    28/37

    Similarity Measures

  • 8/10/2019 Business Intelligence & Data Mining-11

    29/37

    Similarity measures

    n

    For various tasks, we need a measure ofsimilarity between documents

    n Eucledian distance is a common measure

    n Cosine similarity or dot-product is anothermeasure used in text mining:

    n Focuses on co-occurrence of wordsn This corresponds to the cosine of the angle between

    the two vectors

    ||||),(

    21

    2121

    vv

    vvvvsim

    =

  • 8/10/2019 Business Intelligence & Data Mining-11

    30/37

    Text Mining Tasks

  • 8/10/2019 Business Intelligence & Data Mining-11

    31/37

    T t C t i ti A hit t

  • 8/10/2019 Business Intelligence & Data Mining-11

    32/37

    Text Categorization:Architecture

    Training

    documents preprocessingWeighting

    Selecting feature

    Predefinedcategories

    New document

    d

    Classifier

    Category(ies) tod

  • 8/10/2019 Business Intelligence & Data Mining-11

    33/37

    D Cl i

  • 8/10/2019 Business Intelligence & Data Mining-11

    34/37

    Document Clustering

    Task: It groups all documents so that the documents in the same

    group are more similar than ones in other groups.

    Cluster hypothesis: relevant documents tend to be more

    closely related to each other than to

    non-relevant documents.

  • 8/10/2019 Business Intelligence & Data Mining-11

    35/37

    Document Clustering Algorithms

    k-means Hierarchic Agglomerative Clustering(HAC)

    Association Rule Hypergraph Partitioning (ARHP)

    SOM / ESOM based clustering

  • 8/10/2019 Business Intelligence & Data Mining-11

    36/37

    Latent Semantic Indexing

    n Weakness of keyword based techniques

    n Lack of semantics

    n Cannot identify similar word/concepts without help

    n Observationn Words/phrases that represent similar concepts are

    usually grouped together

    n The most important unit of information for

    documents may not be word, but concept instead

  • 8/10/2019 Business Intelligence & Data Mining-11

    37/37

    Latent Semantic Indexing

    n Latent Semantic Indexing is an attempt toproduce such information

    n Start with the term-frequency matrix M

    n

    Apply singular value decomposition on Mn M = U * S * V

    n S = a diagonal matrix of eigenvalues

    n U = a square matrix in which each entry capturesimilarity between documents

    n V = a square matrix in which each entry capturesimilarity between terms