cluster2
DESCRIPTION
TRANSCRIPT
Clustering in Data Warehouse
Department of CEMSPVL Polytechnic CollegePavoorchatram
1
2
Overview
Part 1: what is Data mining Part 2:Association RulesPart 3:ClassificationPart 4: clusteringPart 5: Approaches to data mining
ProblemsPart 6: Application of Data MiningPart7:commercial tools for data mining
3
Part 1: Data Mining
Relationship to data warehouseData mining uses data warehouse to take
decisions. Data warehouse is to support decision making.
Data mining can be applied to operational database with individual transaction.
Data mining helps in extracting meaningful new patterns.
Data mining applications should be considered during the design of a data warehouse. The successful use of database mining applications depends on the construction of data warehouse.
4
5
Define Data Mining
Data mining is sorting through data to identify patterns and establish relationships.
Data mining parameters include: Association - looking for patterns where one event is
connected to another event Sequence or path analysis - looking for patterns where one
event leads to another later event Classification - looking for new patterns (May result in a
change in the way the data is organized but that's ok) Clustering - finding and visually documenting groups of facts
not previously known
6
Part 2: Association Rules
7
Association Rules
Association rules between Set of items in large database
8
Bread ,milkBread ,milk
Milk ,sugarMilk ,sugar
Pen ,inkPen ,ink
Why Association Rules?
9
The general form of association rule is
XYx set of items {x1,x2,….xn}
y Set of items {y1,y2,y3…yn}
The above rule can be stated as database
tuples that satisfy the condition in x are
also likely to satisfy the condition in y.
10
Consider the Purchase Table
Retail shops are often interested in association between different items that people buy. If we refer the table given above it is clear that
People who buy pen also buys inkPeople who buys bread also milk.
11
Association rules measures
SupportConfidence
12
Support
This is the measure of percentage of transaction that contains the union all the items in the LHS and RHS.
Consider the rule PEN INK has a support of 75%. That is the items in LHS U RHS occur in 75% of transactions and a higher support.
Confidence
Confidence is the measure of percentage of transactions that include the items in RHS.
Confidence is a measure of how often the rule is true.
bread MilkConfidence of 80% of
the purchases that include bread also milk.13
14
Part 3: classification
Classification rulesDecision treesMathematical formulaNeural network
15
Some basic operations
Predictive: Regression Classification
Descriptive: Clustering / similarity matching Association rules and variants Deviation detection
16
Classification
Given old data about customers and payments, predict new applicant’s loan eligibility.
AgeSalaryProfessionLocationCustomer type
Previous customers Classifier Decision rules
Salary > 5 L
Prof. = Exec
New applicant’s data
Good/bad
Classification
Classification is a data mining (machine learning) technique used to predict group membership for data instances.
17
18
Why Data Mining Credit ratings/targeted marketing:
Given a database of 100,000 names, which persons are the least likely to default on their credit cards?
Identify likely responders to sales promotions
Fraud detection Which types of transactions are likely to be fraudulent, given
the demographics and transactional history of a particular customer?
Customer relationship management: Which of my customers are likely to be the most loyal, and
which are most likely to leave for a competitor? :
19
Classification
Classification is defined as a process of finding a set of functions that describe and distinguish data classes.
Training Data
Name Age Income Rating
abc 20 low fair
xyz 31…40 Medium
Good
mny 40…50 High Excellent
Classification algorithm
Classification Rules
If age=“31 …. 40”And income=highThen rating = good.
20
classificationThis function we can find out the classes
of the objects whose class labels are not known based on a set of training data.
A training data is a data whose class label is known.
The following are the different forms of classification Classification Rules Decision trees Mathematical formula Neural network
21
Classification methods
Goal: Predict class Ci = f(x1, x2, .. Xn)Regression: (linear or any other
polynomial) a*x1 + b*x2 + c = Ci.
Decision tree classifier: divide decision space into piecewise constant regions.
Neural networks: partition by non-linear boundaries
22
Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.
Decision trees
Salary < 1 M
Prof = teacher
Good
Age < 30
BadBad Good
23
Pros and Cons of decision trees
• Cons– Cannot handle complicated relationship between features– simple decision boundaries– problems with lots of missing data
• Pros+ Reasonable training time+ Fast application+ Easy to interpret+ Easy to implement+ Can handle large number of features
24
Neural networkSet of nodes connected by directed
weighted edges
Hidden nodes
Output nodes
x1
x2
x3
x1
x2
x3
w1
w2
w3
y
n
iii
ey
xwo
1
1)(
)(1
Basic NN unitA more typical NN
25
Pros and Cons of Neural Network
• Cons– Slow training time– Hard to interpret – Hard to implement: trial and error for choosing number of nodes
• Pros+ Can learn more complicated class boundaries+ Fast application+ Can handle large number of features
Conclusion: Use neural nets only if decision trees/NN fail.
classification
26
Part 4:Clustering
Partitioning clustering algorithmHierarchical clustering algorithm
27
Clustering
Unsupervised learning when old data with class labels not available e.g. when introducing a new product.
Group/cluster existing customers based on time series of payment history such that similar customers in same cluster.
Key requirement: Need a good measure of similarity between instances.
clustering
28
Similarity
29
30
Prevalent Interesting
Analysts already know about prevalent rules
Interesting rules are those that deviate from prior expectation
Mining’s payoff is in finding surprising phenomenon
1995
1998
Milk andcereal selltogether!
Zzzz... Milk andcereal selltogether!
31
Clustering Algorithm
Partition clustering Algorithm
Hierarchical clustering algorithm
32
Partition clustering Algorithm
Partition clustering algorithm generates a tree of clusters.
The number of cluster k is given by the user
Hierarchical clustering algorithm
Hierarchical clustering algorithm generates a tree of clusters.
That is in the first step each cluster consists of single record.
In the second step,two cluster are grouped together
In the final step there is a single partition
33
34
Part 6: Approaches to data mining problems
Discovery of sequentialDiscovery of patterns in time seriesDiscovery of classification rulesRegression
Discovery of sequential patterns
35
Trans_id Time Item_Purchased
101 6.35 Milk, bread, juice
792 7.38 Milk, juice
1130 8.05 Milk, eggs
1735 8.40 Bread, cookies ,coffee
Suppose a customer visit the shop three times and purchase the following sequence of item sets.
{ milk, bread, juice }{ bread, eggs }{ cookies, milk, coffee }
The problem of discovering sequential patterns is to find all subsequences from the given sets of sequences that have a user defined minimum support.
Discovery of patterns in time series
Time series are sequence of events having a fixed type of transaction.
The period during which the stock is raised steady for n days.
The longest period over which the stock and a change of not more than 1% over last closing price.
The quarter of a year during which the stock had the most percentage gain or loss.
36
Discovery of classification rules
Classification is a process of defining a function that classifies a given object into many possible classes.
37
Example
A bank wishes to classify its loan applicants into two groups or classes. A group who are loan worthy(eligible) Another group who are not worthy(not eligible)To do the above classification, the bank can use
the classification rule given below
If monthly income greater than 30,000 then they are worthyElse not worthy
38
Regression
Regression is defined as a function over variables which gives a target class variable.
39
Example
Labtest(Patient id,test1,test2,….testn)
This contain values of n test for one patient
The target variable that wish to predict is p, the probability of survival of the patient.
p=f{test1,test2,test3…testn}This function is called regression function.
40
MSPVL Polytechnic college
41