text mining - university of iowa

14
1 Text Mining Text Mining Joseph Engler What is Text Mining Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. M ti H t UC B kl -Marti Hearst, UC Berkeley

Upload: others

Post on 25-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text Mining - University of Iowa

1

Text MiningText Mining

Joseph Engler

What is Text Mining

Text Mining is the discovery by computer of new, previously unknown information, by

automatically extracting information from different written resources.

M ti H t UC B k l-Marti Hearst, UC Berkeley

Page 2: Text Mining - University of Iowa

2

Document Gathering

• Text Databases• Text Databases− United States Patent and Trademark Office− Pacific Union College Nelson Memorial

Library

• Document Repositories− e-Law Document Repository− FAO Corporate Document Repository

• World Wide Web

Basic Measures for Text Retrieval

•Precision: the number of retrieved documents that are in fact

relevant to the query.

•Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved.

{ } { }{ }trieved

trievedlevantprecision

ReReRe ∩

=

{ } { }{ }levant

trievedlevantrecall

ReReRe ∩

=

Page 3: Text Mining - University of Iowa

3

Text Retrieval Methods

•Boolean Retrieval Model•Boolean Retrieval Model−Document is represented by a set of key words−User provides a Boolean expression of key words

•“innovation” AND “process”•“text mining” BUT NOT “boring”

•Document RankingDocument Ranking−Uses the query to rank all documents in order of relevance−Google’s PageRank algorithm is an extension of this model

Tokenization of Text

• Preprocessing step in text mining• Preprocessing step in text mining− Stop Word Removal

• “the”, “and”, “for”

− Word Stemming • Innovation, Innovate, Innovative• “innova”

Page 4: Text Mining - University of Iowa

4

Model the Document for IR

• Term Frequency Matrix• Term Frequency Matrix− Measures the count of termi in documentk

• Inverse Document Frequency− Represents the importance of a term t.− If a term t is frequent in many documents,

its importance is scaled down.

Text Mining With Statistica

Page 5: Text Mining - University of Iowa

5

Statistica Document Retrieval

Statistica Web Crawling

• Non-Focused• Non-Focused

• Can filter file types

• Can specify domain to constrain

• Can specify depth of crawlp y p

• Can specify max number of items in crawling tree

Page 6: Text Mining - University of Iowa

6

Statistica Text Retrieval

Load Documents to Retrieve Text From

Statistica Text Retrieval Cont.

• Allows for the number of words to be retrieve to be set (Advanced Tab)

• Allows for multiple filters (Filters Tab)− Minimum Word Size− Max Word Size

Minimum % of files in which word occurs− Minimum % of files in which word occurs

• Allows for custom lists of stop words and inclusion words (Index Tab)

Page 7: Text Mining - University of Iowa

7

Statistica Text Mining Results

Summary of Word Occurrences

Page 8: Text Mining - University of Iowa

8

Singular Value Decomposition

Ordered Word Importanceafter performing SVD in

Statistica

Statistica Document Clustering

Page 9: Text Mining - University of Iowa

9

Clustering Cont.

Select either K-Means or EM Clusteringand set the number of clusters desired.

Clustering Cont.

Select variables based upon SVD.The variables are continuous in type.

Page 10: Text Mining - University of Iowa

10

Clustering Results

Training Error is usedto specify the number

of clusters.

Choose Number of Clusters

Inflection point indicates the pointat which little gain occurs when

increasing cluster count.

Page 11: Text Mining - University of Iowa

11

Cluster Results Cont.

Save this Worksheetfor ANN Classification

Document Classification

• Utilize Artificial Neural Network• Utilize Artificial Neural Network− Output Variable is the Final Cluster

Membership− Input Variables are those selected to create

the clusters− Train on 80% of the data− Test on 20% of the data

Page 12: Text Mining - University of Iowa

12

Statistica ANN

Statistica ANN Cont.

Set training and testingsample sizes.

One can also include avalidation set if desired.

Page 13: Text Mining - University of Iowa

13

Statistica ANN Results

Click on Predictionsto see how we did.

Statistica Results Cont.

Points in red represent misclassifications

Page 14: Text Mining - University of Iowa

14

Text Mining With Statistica

• Demo• Demo