zhangxi lin isqs 3358 texas tech university 1. define data mining and list its objectives and...
TRANSCRIPT
Zhangxi LinISQS 3358Texas Tech University
1
Define data mining and list its objectives and benefits
Understand different purposes and applications of data mining
Understand different methods of data mining, especially clustering and decision tree models
Build expertise in use of some data mining software
Learn the process of data mining projects Understand data mining pitfalls and
myths Define text mining and its objectives and
benefits Appreciate use of text mining in business
applications Define Web mining and its objectives and
benefits
ISQS 6347, Data & Text Mining 4
Case 1: Credit Card Promotion
Credit card companies periodically send promotion offers, e.g. life insurance promotion, to some potential customers. Assume:
Each promotion letter costs $0.20 The profit from each promotion acceptance is $10 Overall response rate is 1%
Question: Sending the offer to unselected population will result
in the expected average profit $10 * 1% - $0.2 * 99% = -$0.098 ---- a loss. How to send the promotion offers to the right customers in order to make profit?
How to maximize the profit by applying a proper set of selection rules?
Case 2: Customer Segmentation
ID Name Gender Age Occupation C001 X M 15 StudentC002 Y F 30 StaffC003 Z M 18 StudentC004 A F 45 StaffC005 B M 30 StaffC006 C F 25 Student
The data is used to segment the customers for sell promotionThree products: DVD, game, a drink for adultProblems
How to segment the customers into two clustersIs two clusters good enough? Why not three clusters
Data & Text Mining 5
Data & Text Mining 6
Case 3: Association Rule Mining Given a set of transactions, find rules that will
predict the occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},
Implication means co-occurrence, not causality!
Data mining (DM) A process that uses statistical, mathematical, artificial intelligence and machine-learning techniques to extract and identify useful information and subsequent knowledge from large databases
Knowledge discovery in databases (KDD)
A comprehensive process of using data mining methods to find useful information and patterns in data
Major characteristics and objectives of data mining Data are often buried deep within very
large databases, which sometimes contain data from several years; sometimes the data are cleansed and consolidated in a data warehouse
The data mining environment is usually client/server architecture or a Web-based architecture
Major characteristics and objectives of data mining Sophisticated new tools help to remove the
information ore buried in corporate files or archival public records; finding it involves massaging and synchronizing the data to get the right results.
The miner is often an end user, empowered by data drills and other power query tools to ask ad hoc questions and obtain answers quickly, with little or no programming skill
Major characteristics and objectives of data mining Striking it rich often involves finding an
unexpected result and requires end users to think creatively
Data mining tools are readily combined with spreadsheets and other software development tools; the mined data can be analyzed and processed quickly and easily
Parallel processing is sometimes used because of the large amounts of data and massive search efforts
How data mining works Data mining tools find patterns in data
and may even infer rules from them Three methods are used to identify
patterns in data:1. Simple models 2. Intermediate models 3. Complex models
Classification Supervised induction used to analyze the historical data stored in a database and to automatically generate a model that can predict future behavior
Common tools used for classification are: Neural networks Decision trees If-then-else rules
Clustering Partitioning a database into segments in which the members of a segment share similar qualities
Association A category of data mining algorithm that establishes relationships about items that occur together in a given record
Sequence discovery The identification of associations over time
Visualization can be used in conjunction with data mining to gain a clearer understanding of many underlying relationships
Regression is a well-known statistical technique that is used to map data to a prediction value
Forecasting estimates future values based on patterns within large sets of data
– Marketing– Banking– Retailing and sales– Manufacturing and
production– Brokerage and
securities trading– Insurance
– Computer hardware and software
– Government and defense
– Airlines– Health care– Broadcasting – Police– Homeland security
Data mining applications
ISQS 6347, Data & Text Mining 18
20%
80%
Data mining tools and techniques can be classified based on the structure of the data and the algorithms used:
Statistical methods Decision trees
Defined as a root followed by internal nodes. Each node (including root) is labeled with a question and arcs associated with each node cover all possible responses
Data mining tools and techniques can be classified based on the structure of the data and the algorithms used:
Case-based reasoning Neural computing Intelligent agents Genetic algorithms Other tools
Rule induction Data visualization
A general algorithm for building a decision tree:
1. Create a root node and select a splitting attribute.
2. Add a branch to the root node for each split candidate value and label
3. Take the following iterative steps:a. Classify data by applying the split value.b. If a stopping point is reached, then create
leaf node and label it. Otherwise, build another subtree
Gini index Used in economics to measure the diversity of the population. The same concept can be used to determine the ‘purity’ of a specific class as a result of a decision to branch along a particular attribute/variable
ISQS 6347, Data & Text Mining
23
Gini Index for a given node t :
(NOTE: p( j | t) is the relative frequency of class j at node t).
Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information
Minimum (0.0) when all records belong to one class, implying most interesting information
j
tjptGINI 2)]|([1)(
C1 0C2 6
Gini=0.000
C1 2C2 4
Gini=0.444
C1 3C2 3
Gini=0.500
C1 1C2 5
Gini=0.278
ISQS 6347, Data & Text Mining
24
C1 0 C2 6
C1 2 C2 4
C1 1 C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
j
tjptGINI 2)]|([1)(
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
The ID3 (Iterative Dichotomizer 3) algorithm decision tree approach
Entropy Measures the extent of uncertainty or randomness in a data set. If all the data in a subset belong to just one class, then there is no uncertainty or randomness in that dataset, therefore the entropy is zero
ISQS 6347, Data & Text Mining
26
Collection of data objects and their attributes (variables)
An attribute is a property or characteristic of an object
Examples: eye color of a person, temperature, etc.
Attribute is also known as variable, field, characteristic, or feature
A collection of attributes describe an object
Object is also known as record, point, case, sample, entity, or instance
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Attributes (Variables)
Objects
ISQS 6347, Data & Text Mining 27
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Learningalgorithm
Training Set
ISQS 6347, Data & Text Mining 28
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categoric
al
categoric
al
continuous
class
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
ISQS 6347, Data & Text Mining 29
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categoric
al
categoric
al
continuous
classMarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
Married Single,
Divorced
< 80K > 80K
There could be more than one tree that fits the same data!
ISQS 6347, Data & Text Mining 30
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
TreeInductionalgorithm
Training SetDecision Tree
ISQS 6347, Data & Text Mining 31
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test DataStart from the root of tree.
ISQS 6347, Data & Text Mining 32
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
ISQS 6347, Data & Text Mining 33
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
ISQS 6347, Data & Text Mining 34
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
ISQS 6347, Data & Text Mining 35
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
ISQS 6347, Data & Text Mining 36
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Assign Cheat to “No”
ISQS 6347, Data & Text Mining 37
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
TreeInductionalgorithm
Training Set
Decision Tree
ISQS 6347, Data & Text Mining 38
ActualAccept
ActualReject
Computed Accept
Computed Reject
True Positive (TP)a
True Negative (TN)d
False Positive (FP)c
False Negative (FN)b
Accuracy rate = a / (a + c), Coverage rate = a / (a + b)Lift = Accuracy rate / [(a + b) / (a + b + c + d)]
a + b
c + d
a + c b + d
Cluster analysis for data mining Cluster analysis is an exploratory data
analysis tool for solving classification problems
The object is to sort cases into groups so that the degree of association is strong between members of the same cluster and weak between members of different clusters
Cluster analysis results may be used to:
Help identify a classification scheme Suggest statistical models to describe
populations Indicate rules for assigning new cases to
classes for identification, targeting, and diagnostic purposes
Provide measures of definition, size, and change in what were previously broad concepts
Find typical cases to represent classes
Cluster analysis methods Statistical methods Optimal methods Neural networks Fuzzy logic Genetic algorithms
Each of these methods generally works with one of two general method classes:
Divisive Agglomerative
Hierarchical clustering method and example
1. Decide which data to record from the items 2. Calculate the distances between all initial
clusters. Store the results in a distance matrix3. Search through the distance matrix and find the
two most similar clusters4. Fuse those two clusters together to produce a
cluster that has at least two items5. Calculate the distances between this new cluster
and all the other clusters6. Repeat steps 3 to 5 until you have reached the
prespecified maximum number of clusters
Classes of data mining tools and techniques as they relate to information and business intelligence (BI) technologies
Mathematical and statistical analysis packages Personalization tools for Web-based marketing Analytics built into marketing platforms Advanced CRM tools Analytics added to other vertical industry-
specific platforms Analytics added to database tools (e.g., OLAP) Standalone data mining tools
45
What Is Text Mining? Text mining is a process that
employs a set of algorithms for converting unstructured text into structured data objects and the quantitative methods used to analyze these data objects.
Text Mining Case: Federalist papers Alexander Hamilton, James Madison, and John Jay wrote a series of
essays in 1787 and 1788 to try to convince the citizens of the state of New York to ratify the new constitution of the United States. These essays are collectively called The Federalist Papers. Copies of the papers in a variety of formats can be found at
http://www.yale.edu/lawweb/avalon/federal/fed.htm, or http://www.constitution.org/fed/federa00.htm.
Of the 85 essays, 51 are attributed to Hamilton, 15 to Madison, 5 to Jay, and 3 to Hamilton and Madison jointly. The 11 remaining essays can be attributed only to Hamilton or Madison. Mosteller and Wallace (1964) used Bayesian statistical techniques to provide evidence that Madison wrote all 11 of the essays of unknown authorship. (The essays in question are numbers 49, 50, 51, 52, 53, 54, 55, 56, 57, 62, and 63.)
Problem: Uniquely identify an author based on the distribution of words in a document.
46
47
A simple text mining example
A tiny case - 9 documents deposit the cash and check in the bank - Fin the river boat is on the bank - Riv borrow based on credit - Fin river boat floats up the river - Riv boat is by the dock near the bank - Riv with credit, I can borrow cash from the bank - Fin boat floats by dock near the river bank - Riv check the parade route to see the floats - Par along the parade route - Par
Text mining helps organizations: Find the “hidden” content of documents,
including additional useful relationships Relate documents across previous
unnoticed divisions Group documents by common themes
Applications of text mining Automatic detection of e-mail spam or
phishing through analysis of the document content
Automatic processing of messages or e-mails to route a message to the most appropriate party to process that message
Analysis of warranty claims, help desk calls/reports, and so on to identify the most common problems and relevant responses
Applications of text mining Analysis of related scientific publications
in journals to create an automated summary view of a particular discipline
Creation of a “relationship view” of a document collection
Qualitative analysis of documents to detect deception
How to mine text 1. Eliminate commonly used words (stop-
words)2. Replace words with their stems or roots
(stemming algorithms)3. Consider synonyms and phrases 4. Calculate the weights of the remaining
terms
52
Example
Coca-Cola announced earnings on Saturday, Dec. 12, 2000. Profits were up by 3.1% as of 12/12/1999.
coca-cola + announce earnings on Saturday dec. 12 2000 + profit + be up 3.1% as of 2000-12-12 1999-12-12
Web mining The discovery and analysis of interesting and useful information from the Web, about the Web, and usually through Web-based tools
Web content mining The extraction of useful information from Web pages
Web structure mining The development of useful information from the links included in the Web documents
Web usage mining The extraction of useful information from the data being generated through webpage visits, transaction, etc.
Uses for Web mining: Determine the lifetime value of clients Design cross-marketing strategies
across products Evaluate promotional campaigns Target electronic ads and coupons at
user groups Predict user behavior Present dynamic information to users
58
Banners
Landing page
Sign up
Target page
Click
Click
Click
BANNERAD
ABANDON
PROPBUY
Buy
Exit
Exit
Exit
Exit
Depth of conversion
Sign up
First time purchase
Repeated purchase
Data
Data
Data
Data
How to improve the effectiveness of banner advertising? Understand the context:
Availability of the information: click-through flow, user profile, etc.
Multiple ads – which one should be used? Data collection Data mining Model evaluation
60
Model can be built using Web log data Registration data Vendor data (may not be required)
One model with indicator for banner ad/vendor selected
Multiple models, one for each vendor Overlapping data if page sequences are included,
because “did not click” entries will have common elements in all models
Model scores the propensity to click on a vendor’s banner ad
In the case there is only one slot for one of two ads, which one is the best decision: Selectively place an ad from the two choices Randomly place one of the ads Place both with two slots, or time-sharing
alternatively Place nothing when the likelihood of the click-
through is low, because of the possible negative effect.
61
ISQS 6347, Data & Text Mining 62
SAS Enterprise Miner 4.3
Basic How to use the application main menu Using the pop-up menus Enterprise Miner documentation Project – Diagram
The SEMMA methodology Sample Explore Modify Model Assess
Decision Tree Example (pp147-151)
Income Pattern# Loan risk
17 1 High
20 5 High
23 0 High
32 4 Low
43 2 High
68 3 Low
MBA Admission Decision Problem
GMAT GPA Quantitative GAMT Score (percentile)
Decision
650 2.75 35 No
580 3.50 70 No
600 3.50 75 Yes
450 2.95 80 No
700 3.25 90 Yes
590 3.50 80 Yes
400 3.85 45 No
640 3.50 75 Yes
540 3.00 60 ?
590 2.85 80 ?
490 4.00 65 ?