ics 278: data mining lecture 12: text mining
DESCRIPTION
ICS 278: Data Mining Lecture 12: Text Mining. Padhraic Smyth Department of Information and Computer Science University of California, Irvine. Text Mining. Information Retrieval Text Classification Text Clustering Information Extraction. Text Mining Applications. Information Retrieval - PowerPoint PPT PresentationTRANSCRIPT
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
ICS 278: Data Mining
Lecture 12: Text Mining
Padhraic SmythDepartment of Information and Computer Science
University of California, Irvine
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Text Mining
• Information Retrieval
• Text Classification
• Text Clustering
• Information Extraction
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Text Mining Applications
• Information Retrieval– Query-based search of large text archives, e.g., the Web
• Text Classification– Automated assignment of topics to Web pages, e.g., Yahoo, Google– Automated classification of email into spam and non-spam
• Text Clustering– Automated organization of search results in real-time into categories– Discovery clusters and trends in technical literature (e.g. CiteSeer)
• Information Extraction– Extracting standard fields from free-text
• extracting names and places from reports, newspapers (e.g., military applications)
• extracting resume information automatically from resumes• Extracting protein interaction information from biology papers
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Text Mining
• Information Retrieval
• Text Classification
• Text Clustering
• Information Extraction
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
General concepts in Information Retrieval
• Representation language– typically a vector of p attribute values, e.g.,
• set of color, intensity, texture, features characterizing images• word counts for text documents
• Data set D of N objects– Typically represented as an N x p matrix
• Query Q:– User poses a query to search D– Query is typically expressed in the same representation
language as the data, e.g.,• each text document is a set of words that occur in the document• Query Q is also expressed as a set of words, e.g.,”data” and
“mining”
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Query by Content
• traditional DB query: exact matches– e.g. query Q = [level = MANAGER] & [age < 30]
• query-by-content query: more general / less precise– e.g. Q = what historic record most similar to new one?– for text data, often called “information retrieval (IR)”
• Goal– Match query Q to the N objects in the database– Return a ranked list (typically) of the most similar/relevant
objects in the data set D given Q
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Issues in Query by Content
– What representation language to use– How to measure similarity between Q and each object in D– How to compute the results in real-time (for interactive
querying)– How to rank the results for the user– Allowing user feedback (query modification)– How to evaluate and compare different IR algorithms/systems
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
The Standard Approach
• fixed-length (d dimensional) vector representation– for query (d-by-1 Q) and and database (d-by-n X) objects
• use domain-specific higher-level features (vs raw)– image
• color (e.g. RGB), texture (e.g. Gabor, Fourier coeffs), …
– text• “bag of words”: freq count for each word in each document, …
• compute distances between vectorized representation
• use k-NN to find k vectors in X closest to Q
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Evaluating Retrieval Methods
• predictive models (classify/regress) objective– score = accuracy on unseen test data
• evaluation more complex for query by content– real score = how “useful” is retrieved info (subjective)
• e.g. how would you define real score for Google’s top 10 hits?
• towards objectivity, assume:– 1) each object is “relevant” or “irrelevant”
• simplification: binary and same for all users (e.g. committee vote)– 2) each object labelled by objective/consistent oracle– these assumptions suggest classifier approach possible
• rather different goals: want nearests to Q, not separability per se• but would require learning classifier at query time (Q = pos class)
– which is why k-NN type approach seems so appropriate …
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Precision versus Recall
• DQ = Q’s ranked retrievals (smallest distance first)
• DQT = those with distance < threshold • Threshold ~0: few false positives (FP) (say relevant, but not), many false
neg (FN)• large threshold: few false negative (FP), many false pos (FP)
• precision = TP / (TP+FP)– fraction of retrieved objects that are relevant
• recall = TP / (TP + FN)– fraction of retrieved objects / total relevant objects
• Tradeoff: high precision -> low recall, and vice-versa
• For multiple queries, precision for specific ranges of recall can be averaged (so-called “interpolated precision”).
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Precision-Recall Curve (form of ROC)
C is universallyworse than A & B
alternative(point) values:
precision whererecall=precision
or
precision for fixednumber of retrievals
or
average precisionover multiple recall levels
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
TREC evaluations
• Text Retrieval Conference (TReC)– Web site: trec.nist.gov
• Annual impartial evaluation of IR systems– e.g., D = 1 million documents– TREC organizers supply contestants with several hundred
queries Q
– Each competing system provides its ranked list of queries
– Union of top 100 queries or so from each system is then manually judges to be relevant or non-relevant for each query Q
– Precision, recall, etc, then calculated and systems compared
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Text Retrieval
• document: book, paper, WWW page, ...• term: word, word-pair, phrase, … (often: 50,000+)• query Q = set of terms, e.g., “data” + “mining”• NLP (natural language processing) too hard, so …• want (vector) representation for text which
– retains maximum useful semantics– supports efficient distance computes between docs and Q
• term weights – Boolean (e.g. term in document or not); “bag of words”– real-valued (e.g. freq term in doc; relative to all docs) ...
• notice: loses word order, sentence structure, etc.
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Toy example of a document-term matrix
database
SQL index
regression
likelihood
linear
d1 24 21 9 0 0 3
d2 32 10 5 0 3 0
d3 12 16 5 0 0 0
d4 6 7 2 0 0 0
d5 43 31 20 0 3 0
d6 2 0 0 18 7 16
d7 0 0 1 32 12 0
d8 3 0 0 22 4 2
d9 1 0 0 34 27 25
d10 6 0 0 17 4 23
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Distances between Documents
• Measuring distance between 2 documents:– wide variety of distance metrics:
• Euclidean (L2) = sqrt(i(xi - yi)2)
• L1 = I |xi - yi |
• ...
• weighted L2 = sqrt(i(wixi - wiyi)2)
• Cosine distance between docs Di = (di1,…,diT)
– dc(Di,Dj) = k=1…T dikdjk / sqrt( k=1…T dik2 k=1…T dik
2)
– Can give better results than Euclidean • because it normalizes relative to document length
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Distance matrices for toy document-term
data
TF doc-term matrix t1 t2 t3 t4 t5 t6d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23
EuclideanDistances
CosineDistances
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
TF-IDF Term Weighting Schemes
• binary weights favor larger documents, so ...
• TF (term freq): term weight = number of times in that document– problem: term common to many docs => low discrimination
• IDF (inverse-document frequency of a term)– nj documents contain term j, N documents in total
– IDF = log(N/nj)
– Favors terms that occur in relatively few documents
• TF-IDF: TF(term)*IDF(term)
• No real theoretical basis, but works well empirically and widely used
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
TF-IDF Example
TF doc-term matrix t1 t2 t3 t4 t5 t6d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23
TF-IDF doc-term mat t1 t2 t3 t4 t5 t6d1 2.5 14.6 4.6 0 0 2.1d2 3.4 6.9 2.6 0 1.1 0d3 1.3 11.1 2.6 0 0 0d4 0.6 4.9 1.0 0 0 0d5 4.5 21.5 10.2 0 1.1 0 ...
TF-IDF(t1 in D1) = TF*IDF = 24 * log(10/9)
IDF weights are (0.1, 0.7, 0.5, 0.7, 0.4, 0.7)
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Typical Document Querying System
• Queries Q = binary term vectors
• Documents represented by TF-IDF weights
• Cosine distance used for retrieval and ranking
TF TF-IDFd1 0.70 0.32d2 0.77 0.51d3 0.58 0.24d4 0.60 0.23d5 0.79 0.43 ...
TF doc-term matrix t1 t2 t3 t4 t5 t6d1 24 21 9 0 0 3d2 32 10 5 0 3 0d3 12 16 5 0 0 0d4 6 7 2 0 0 0d5 43 31 20 0 3 0d6 2 0 0 18 7 16d7 0 0 1 32 12 0d8 3 0 0 22 4 2d9 1 0 0 34 27 25d10 6 0 0 17 4 23
Q=(1,0,1,0,0,0)
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Synonymy and Polysemy
• Synonymy– the same concept can be expressed using different sets of
terms• e.g. bandit, brigand, thief
– negatively affects recall
• Polysemy– identical terms can be used in very different semantic contexts
• e.g. bank– repository where important material is saved– the slope beside a body of water
– negatively affects precision
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Latent Semantic Indexing• Approximate data in the original d-dimensional space by data in a k-
dimensional space, where k << d
• Find the k linear projections of the data that contain the most variance– Principal components analysis or SVD– Also known as “latent semantic indexing” when applied to text
• Captures dependencies among terms– In effect represents original d-dimensional basis with a k-dimensional basis– e.g., terms like SQL, indexing, query, could be approximated as coming
from a single “hidden” term
• Why is this useful?– Query contains “automobile”, document contains “vehicle”– can still match Q to the document since the 2 terms will be close in k-
space (but not in original space), i.e., addresses synonymy problem
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Toy example of a document-term matrix
database
SQL index
regression
likelihood
linear
d1 24 21 9 0 0 3
d2 32 10 5 0 3 0
d3 12 16 5 0 0 0
d4 6 7 2 0 0 0
d5 43 31 20 0 3 0
d6 2 0 0 18 7 16
d7 0 0 1 32 12 0
d8 3 0 0 22 4 2
d9 1 0 0 34 27 25
d10 6 0 0 17 4 23
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
SVD
• M = U S VT
– M = n x d = original document-term matrix (the data)
– U = n x d , each row = vector of weights for each document
– S = d x d diagonal matrix of eigenvalues
– Columns of VT = new orthogonal basis for the data
– Each eigenvalue represents how much information is of the new “basis” vectors
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Example of SVD
U1 U2
d1 30.9 -11.5
d2 30.3 -10.8
d3 18.0 -7.7
d4 8.4 -3.6
d5 52.7 -20.6
d6 14.2 21.8
d7 10.8 21.9
d8 11.5 28.0
d9 9.5 17.8
d10 19.9 45.0
database
SQL index regression likelihood linear
d1 24 21 9 0 0 3
d2 32 10 5 0 3 0
d3 12 16 5 0 0 0
d4 6 7 2 0 0 0
d5 43 31 20 0 3 0
d6 2 0 0 18 7 16
d7 0 0 1 32 12 0
d8 3 0 0 22 4 2
d9 1 0 0 34 27 25
d10 6 0 0 17 4 23
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
v1 = [0.74, 0.49, 0.27, 0.28, 0.18, 0.19]v2 = [-0.28, -0.24 -0.12, 0.74, 0.37, 0.31]
D1 = database x 50D2 = SQL x 50
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Another LSI Example
• A collection of documents:
d1: Indian government goes for open-source software
d2: Debian 3.0 Woody released
d3: Wine 2.0 released with fixes for Gentoo 1.4 and Debian 3.0
d4: gnuPOD released: iPOD on Linux… with GPLed software
d5: Gentoo servers running at open-source mySQL database
d6: Dolly the sheep not totally identical clone
d7: DNA news: introduced low-cost human genome DNA chip
d8: Malaria-parasite genome database on the Web
d9: UK sets up genome bank to protect rare sheep breeds
d10: Dolly’s DNA damaged
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
LSI Example (continued)
• The term-document matrix X
d1 d2 d3 d4 d5 d6 d7 d8 d9 d10open-source 1 0 0 0 1 0 0 0 0 0software 1 0 0 1 0 0 0 0 0 0Linux 0 0 0 1 0 0 0 0 0 0released 0 1 1 1 0 0 0 0 0 0Debian 0 1 1 0 0 0 0 0 0 0Gentoo 0 0 1 0 1 0 0 0 0 0database 0 0 0 0 1 0 0 1 0 0Dolly 0 0 0 0 0 1 0 0 0 1sheep 0 0 0 0 0 1 0 0 0 0 genome 0 0 0 0 0 0 1 1 1 0DNA 0 0 0 0 0 0 2 0 0 1
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
LSI Example• The reconstructed term-document matrix after projecting
on a subspace of dimension K=2
= diag(2.57, 2.49, 1.99, 1.9, 1.68, 1.53, 0.94, 0.66, 0.36, 0.10)
d1 d2 d3 d4 d5 d6 d7 d8 d9 d10open-source 0.34 0.28 0.38 0.42 0.24 0.00 0.04 0.07 0.02 0.01software 0.44 0.37 0.50 0.55 0.31 -0.01 -0.03 0.06 0.00 -0.02Linux 0.44 0.37 0.50 0.55 0.31 -0.01 -0.03 0.06 0.00 -0.02released 0.63 0.53 0.72 0.79 0.45 -0.01 -0.05 0.09 -0.00 -0.04Debian 0.39 0.33 0.44 0.48 0.28 -0.01 -0.03 0.06 0.00 -0.02Gentoo 0.36 0.30 0.41 0.45 0.26 0.00 0.03 0.07 0.02 0.01database 0.17 0.14 0.19 0.21 0.14 0.04 0.25 0.11 0.09 0.12Dolly -0.01 -0.01 -0.01 -0.02 0.03 0.08 0.45 0.13 0.14 0.21sheep -0.00 -0.00 -0.00 -0.01 0.03 0.06 0.34 0.10 0.11 0.16genome 0.02 0.01 0.02 0.01 0.10 0.19 1.11 0.34 0.36 0.53DNA -0.03 -0.04 -0.04 -0.06 0.11 0.30 1.70 0.51 0.55 0.81
T̂
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Further Reading
• Text: Chapter 14
• Web-related Document Search:– An excellent resource on Web-related search is Chapter 3, Web
Search and Information Retrieval, in Mining the Web: Discovering Knowledge from Hypertext Data, S. Chakrabarti, Morgan Kaufmann, 2003, is also excellent.
– Information on how real Web search engines work:• http://searchenginewatch.com/
• Latent Semantic Analysis– Applied to grading of essays: The debate on automated grading,
IEEE Intelligent Systems, September/October 2000. Online athttp://www.k-a-t.com/papers/IEEEdebate.pdf
Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine
Next up ….
• Information Retrieval
• Text Classification
• Text Clustering
• Information Extraction