ics 278: data mining lecture 12: text mining

Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine

ICS 278: Data Mining

Lecture 12: Text Mining

Padhraic SmythDepartment of Information and Computer Science

University of California, Irvine


Text Mining

• Information Retrieval

• Text Classification

• Text Clustering

• Information Extraction


Text Mining Applications

• Information Retrieval– Query-based search of large text archives, e.g., the Web

• Text Classification– Automated assignment of topics to Web pages, e.g., Yahoo, Google– Automated classification of email into spam and non-spam

• Text Clustering– Automated organization of search results in real-time into categories– Discovery clusters and trends in technical literature (e.g. CiteSeer)

• Information Extraction– Extracting standard fields from free-text

• extracting names and places from reports, newspapers (e.g., military applications)

• extracting resume information automatically from resumes• Extracting protein interaction information from biology papers


Text Mining



• Text Clustering



General concepts in Information Retrieval

• Representation language– typically a vector of p attribute values, e.g.,

• set of color, intensity, texture, features characterizing images• word counts for text documents

• Data set D of N objects– Typically represented as an N x p matrix

• Query Q:– User poses a query to search D– Query is typically expressed in the same representation

language as the data, e.g.,• each text document is a set of words that occur in the document• Query Q is also expressed as a set of words, e.g.,”data” and

“mining”


Query by Content

• traditional DB query: exact matches– e.g. query Q = [level = MANAGER] & [age < 30]

• query-by-content query: more general / less precise– e.g. Q = what historic record most similar to new one?– for text data, often called “information retrieval (IR)”

• Goal– Match query Q to the N objects in the database– Return a ranked list (typically) of the most similar/relevant

objects in the data set D given Q


Issues in Query by Content

– What representation language to use– How to measure similarity between Q and each object in D– How to compute the results in real-time (for interactive

querying)– How to rank the results for the user– Allowing user feedback (query modification)– How to evaluate and compare different IR algorithms/systems


The Standard Approach

• fixed-length (d dimensional) vector representation– for query (d-by-1 Q) and and database (d-by-n X) objects

• use domain-specific higher-level features (vs raw)– image

• color (e.g. RGB), texture (e.g. Gabor, Fourier coeffs), …

– text• “bag of words”: freq count for each word in each document, …

• compute distances between vectorized representation

• use k-NN to find k vectors in X closest to Q


Evaluating Retrieval Methods

• predictive models (classify/regress) objective– score = accuracy on unseen test data

• evaluation more complex for query by content– real score = how “useful” is retrieved info (subjective)

• e.g. how would you define real score for Google’s top 10 hits?

• towards objectivity, assume:– 1) each object is “relevant” or “irrelevant”

• simplification: binary and same for all users (e.g. committee vote)– 2) each object labelled by objective/consistent oracle– these assumptions suggest classifier approach possible

• rather different goals: want nearests to Q, not separability per se• but would require learning classifier at query time (Q = pos class)

– which is why k-NN type approach seems so appropriate …


Precision versus Recall

• DQ = Q’s ranked retrievals (smallest distance first)

• DQT = those with distance < threshold • Threshold ~0: few false positives (FP) (say relevant, but not), many false

neg (FN)• large threshold: few false negative (FP), many false pos (FP)

• precision = TP / (TP+FP)– fraction of retrieved objects that are relevant

• recall = TP / (TP + FN)– fraction of retrieved objects / total relevant objects

• Tradeoff: high precision -> low recall, and vice-versa

• For multiple queries, precision for specific ranges of recall can be averaged (so-called “interpolated precision”).


Precision-Recall Curve (form of ROC)

C is universallyworse than A & B

alternative(point) values:

precision whererecall=precision

or

precision for fixednumber of retrievals

or

average precisionover multiple recall levels


TREC evaluations

• Text Retrieval Conference (TReC)– Web site: trec.nist.gov

• Annual impartial evaluation of IR systems– e.g., D = 1 million documents– TREC organizers supply contestants with several hundred

queries Q

– Each competing system provides its ranked list of queries

– Union of top 100 queries or so from each system is then manually judges to be relevant or non-relevant for each query Q

– Precision, recall, etc, then calculated and systems compared


Text Retrieval

• document: book, paper, WWW page, ...• term: word, word-pair, phrase, … (often: 50,000+)• query Q = set of terms, e.g., “data” + “mining”• NLP (natural language processing) too hard, so …• want (vector) representation for text which

– retains maximum useful semantics– supports efficient distance computes between docs and Q

• term weights – Boolean (e.g. term in document or not); “bag of words”– real-valued (e.g. freq term in doc; relative to all docs) ...

• notice: loses word order, sentence structure, etc.


Toy example of a document-term matrix

database

SQL index

regression

likelihood

linear

d1 24 21 9 0 0 3

d2 32 10 5 0 3 0

d3 12 16 5 0 0 0

d4 6 7 2 0 0 0

d5 43 31 20 0 3 0

d6 2 0 0 18 7 16

d7 0 0 1 32 12 0

d8 3 0 0 22 4 2

d9 1 0 0 34 27 25

d10 6 0 0 17 4 23


Distances between Documents

• Measuring distance between 2 documents:– wide variety of distance metrics:

• Euclidean (L2) = sqrt(i(xi - yi)2)

• L1 = I |xi - yi |

• ...

• weighted L2 = sqrt(i(wixi - wiyi)2)

• Cosine distance between docs Di = (di1,…,diT)

– dc(Di,Dj) = k=1…T dikdjk / sqrt( k=1…T dik2 k=1…T dik

2)

– Can give better results than Euclidean • because it normalizes relative to document length


Distance matrices for toy document-term

data

TF doc-term matrix t1 t2 t3 t4 t5 t6d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23

EuclideanDistances

CosineDistances


TF-IDF Term Weighting Schemes

• binary weights favor larger documents, so ...

• TF (term freq): term weight = number of times in that document– problem: term common to many docs => low discrimination

• IDF (inverse-document frequency of a term)– nj documents contain term j, N documents in total

– IDF = log(N/nj)

– Favors terms that occur in relatively few documents

• TF-IDF: TF(term)*IDF(term)

• No real theoretical basis, but works well empirically and widely used


TF-IDF Example


TF-IDF doc-term mat t1 t2 t3 t4 t5 t6d1 2.5 14.6 4.6 0 0 2.1d2 3.4 6.9 2.6 0 1.1 0d3 1.3 11.1 2.6 0 0 0d4 0.6 4.9 1.0 0 0 0d5 4.5 21.5 10.2 0 1.1 0 ...

TF-IDF(t1 in D1) = TF*IDF = 24 * log(10/9)

IDF weights are (0.1, 0.7, 0.5, 0.7, 0.4, 0.7)


Typical Document Querying System

• Queries Q = binary term vectors

• Documents represented by TF-IDF weights

• Cosine distance used for retrieval and ranking

TF TF-IDFd1 0.70 0.32d2 0.77 0.51d3 0.58 0.24d4 0.60 0.23d5 0.79 0.43 ...


Q=(1,0,1,0,0,0)


Synonymy and Polysemy

• Synonymy– the same concept can be expressed using different sets of

terms• e.g. bandit, brigand, thief

– negatively affects recall

• Polysemy– identical terms can be used in very different semantic contexts

• e.g. bank– repository where important material is saved– the slope beside a body of water

– negatively affects precision


Latent Semantic Indexing• Approximate data in the original d-dimensional space by data in a k-

dimensional space, where k << d

• Find the k linear projections of the data that contain the most variance– Principal components analysis or SVD– Also known as “latent semantic indexing” when applied to text

• Captures dependencies among terms– In effect represents original d-dimensional basis with a k-dimensional basis– e.g., terms like SQL, indexing, query, could be approximated as coming

from a single “hidden” term

• Why is this useful?– Query contains “automobile”, document contains “vehicle”– can still match Q to the document since the 2 terms will be close in k-

space (but not in original space), i.e., addresses synonymy problem


Toy example of a document-term matrix

database

SQL index

regression

likelihood

linear

d1 24 21 9 0 0 3

d2 32 10 5 0 3 0

d3 12 16 5 0 0 0

d4 6 7 2 0 0 0

d5 43 31 20 0 3 0

d6 2 0 0 18 7 16

d7 0 0 1 32 12 0

d8 3 0 0 22 4 2

d9 1 0 0 34 27 25

d10 6 0 0 17 4 23


SVD

• M = U S VT

– M = n x d = original document-term matrix (the data)

– U = n x d , each row = vector of weights for each document

– S = d x d diagonal matrix of eigenvalues

– Columns of VT = new orthogonal basis for the data

– Each eigenvalue represents how much information is of the new “basis” vectors


Example of SVD

U1 U2

d1 30.9 -11.5

d2 30.3 -10.8

d3 18.0 -7.7

d4 8.4 -3.6

d5 52.7 -20.6

d6 14.2 21.8

d7 10.8 21.9

d8 11.5 28.0

d9 9.5 17.8

d10 19.9 45.0

database

SQL index regression likelihood linear

d1 24 21 9 0 0 3

d2 32 10 5 0 3 0

d3 12 16 5 0 0 0

d4 6 7 2 0 0 0

d5 43 31 20 0 3 0

d6 2 0 0 18 7 16

d7 0 0 1 32 12 0

d8 3 0 0 22 4 2

d9 1 0 0 34 27 25

d10 6 0 0 17 4 23


v1 = [0.74, 0.49, 0.27, 0.28, 0.18, 0.19]v2 = [-0.28, -0.24 -0.12, 0.74, 0.37, 0.31]

D1 = database x 50D2 = SQL x 50


Another LSI Example

• A collection of documents:

d1: Indian government goes for open-source software

d2: Debian 3.0 Woody released

d3: Wine 2.0 released with fixes for Gentoo 1.4 and Debian 3.0

d4: gnuPOD released: iPOD on Linux… with GPLed software

d5: Gentoo servers running at open-source mySQL database

d6: Dolly the sheep not totally identical clone

d7: DNA news: introduced low-cost human genome DNA chip

d8: Malaria-parasite genome database on the Web

d9: UK sets up genome bank to protect rare sheep breeds

d10: Dolly’s DNA damaged


LSI Example (continued)

• The term-document matrix X

d1 d2 d3 d4 d5 d6 d7 d8 d9 d10open-source 1 0 0 0 1 0 0 0 0 0software 1 0 0 1 0 0 0 0 0 0Linux 0 0 0 1 0 0 0 0 0 0released 0 1 1 1 0 0 0 0 0 0Debian 0 1 1 0 0 0 0 0 0 0Gentoo 0 0 1 0 1 0 0 0 0 0database 0 0 0 0 1 0 0 1 0 0Dolly 0 0 0 0 0 1 0 0 0 1sheep 0 0 0 0 0 1 0 0 0 0 genome 0 0 0 0 0 0 1 1 1 0DNA 0 0 0 0 0 0 2 0 0 1


LSI Example• The reconstructed term-document matrix after projecting

on a subspace of dimension K=2

= diag(2.57, 2.49, 1.99, 1.9, 1.68, 1.53, 0.94, 0.66, 0.36, 0.10)

d1 d2 d3 d4 d5 d6 d7 d8 d9 d10open-source 0.34 0.28 0.38 0.42 0.24 0.00 0.04 0.07 0.02 0.01software 0.44 0.37 0.50 0.55 0.31 -0.01 -0.03 0.06 0.00 -0.02Linux 0.44 0.37 0.50 0.55 0.31 -0.01 -0.03 0.06 0.00 -0.02released 0.63 0.53 0.72 0.79 0.45 -0.01 -0.05 0.09 -0.00 -0.04Debian 0.39 0.33 0.44 0.48 0.28 -0.01 -0.03 0.06 0.00 -0.02Gentoo 0.36 0.30 0.41 0.45 0.26 0.00 0.03 0.07 0.02 0.01database 0.17 0.14 0.19 0.21 0.14 0.04 0.25 0.11 0.09 0.12Dolly -0.01 -0.01 -0.01 -0.02 0.03 0.08 0.45 0.13 0.14 0.21sheep -0.00 -0.00 -0.00 -0.01 0.03 0.06 0.34 0.10 0.11 0.16genome 0.02 0.01 0.02 0.01 0.10 0.19 1.11 0.34 0.36 0.53DNA -0.03 -0.04 -0.04 -0.06 0.11 0.30 1.70 0.51 0.55 0.81

T̂


Further Reading

• Text: Chapter 14

• Web-related Document Search:– An excellent resource on Web-related search is Chapter 3, Web

Search and Information Retrieval, in Mining the Web: Discovering Knowledge from Hypertext Data, S. Chakrabarti, Morgan Kaufmann, 2003, is also excellent.

– Information on how real Web search engines work:• http://searchenginewatch.com/

• Latent Semantic Analysis– Applied to grading of essays: The debate on automated grading,

IEEE Intelligent Systems, September/October 2000. Online athttp://www.k-a-t.com/papers/IEEEdebate.pdf

http://searchenginewatch.com/

http://www.k-a-t.com/papers/IEEEdebate.pdf


Next up ….



• Text Clustering