cluster2

41
Clustering in Data Warehouse Department of CE MSPVL Polytechnic College Pavoorchatram 1

Upload: work

Post on 24-Jan-2015

658 views

Category:

Education


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Cluster2

Clustering in Data Warehouse

Department of CEMSPVL Polytechnic CollegePavoorchatram

1

Page 2: Cluster2

2

Overview

Part 1: what is Data mining Part 2:Association RulesPart 3:ClassificationPart 4: clusteringPart 5: Approaches to data mining

ProblemsPart 6: Application of Data MiningPart7:commercial tools for data mining

Page 3: Cluster2

3

Part 1: Data Mining

Page 4: Cluster2

Relationship to data warehouseData mining uses data warehouse to take

decisions. Data warehouse is to support decision making.

Data mining can be applied to operational database with individual transaction.

Data mining helps in extracting meaningful new patterns.

Data mining applications should be considered during the design of a data warehouse. The successful use of database mining applications depends on the construction of data warehouse.

4

Page 5: Cluster2

5

Define Data Mining

Data mining is sorting through data to identify patterns and establish relationships.

Data mining parameters include: Association - looking for patterns where one event is

connected to another event Sequence or path analysis - looking for patterns where one

event leads to another later event Classification - looking for new patterns (May result in a

change in the way the data is organized but that's ok) Clustering - finding and visually documenting groups of facts

not previously known

Page 6: Cluster2

6

Part 2: Association Rules

Page 7: Cluster2

7

Association Rules

Association rules between Set of items in large database

Page 8: Cluster2

8

Bread ,milkBread ,milk

Milk ,sugarMilk ,sugar

Pen ,inkPen ,ink

Why Association Rules?

Page 9: Cluster2

9

The general form of association rule is

XYx set of items {x1,x2,….xn}

y Set of items {y1,y2,y3…yn}

The above rule can be stated as database

tuples that satisfy the condition in x are

also likely to satisfy the condition in y.

Page 10: Cluster2

10

Consider the Purchase Table

Retail shops are often interested in association between different items that people buy. If we refer the table given above it is clear that

People who buy pen also buys inkPeople who buys bread also milk.

Page 11: Cluster2

11

Association rules measures

SupportConfidence

Page 12: Cluster2

12

Support

This is the measure of percentage of transaction that contains the union all the items in the LHS and RHS.

Consider the rule PEN INK has a support of 75%. That is the items in LHS U RHS occur in 75% of transactions and a higher support.

Page 13: Cluster2

Confidence

Confidence is the measure of percentage of transactions that include the items in RHS.

Confidence is a measure of how often the rule is true.

bread MilkConfidence of 80% of

the purchases that include bread also milk.13

Page 14: Cluster2

14

Part 3: classification

Classification rulesDecision treesMathematical formulaNeural network

Page 15: Cluster2

15

Some basic operations

Predictive: Regression Classification

Descriptive: Clustering / similarity matching Association rules and variants Deviation detection

Page 16: Cluster2

16

Classification

Given old data about customers and payments, predict new applicant’s loan eligibility.

AgeSalaryProfessionLocationCustomer type

Previous customers Classifier Decision rules

Salary > 5 L

Prof. = Exec

New applicant’s data

Good/bad

Page 17: Cluster2

Classification

Classification is a data mining (machine learning) technique used to predict group membership for data instances.

17

Page 18: Cluster2

18

Why Data Mining Credit ratings/targeted marketing:

Given a database of 100,000 names, which persons are the least likely to default on their credit cards?

Identify likely responders to sales promotions

Fraud detection Which types of transactions are likely to be fraudulent, given

the demographics and transactional history of a particular customer?

Customer relationship management: Which of my customers are likely to be the most loyal, and

which are most likely to leave for a competitor? :

Page 19: Cluster2

19

Classification

Classification is defined as a process of finding a set of functions that describe and distinguish data classes.

Training Data

Name Age Income Rating

abc 20 low fair

xyz 31…40 Medium

Good

mny 40…50 High Excellent

Classification algorithm

Classification Rules

If age=“31 …. 40”And income=highThen rating = good.

Page 20: Cluster2

20

classificationThis function we can find out the classes

of the objects whose class labels are not known based on a set of training data.

A training data is a data whose class label is known.

The following are the different forms of classification Classification Rules Decision trees Mathematical formula Neural network

Page 21: Cluster2

21

Classification methods

Goal: Predict class Ci = f(x1, x2, .. Xn)Regression: (linear or any other

polynomial) a*x1 + b*x2 + c = Ci.

Decision tree classifier: divide decision space into piecewise constant regions.

Neural networks: partition by non-linear boundaries

Page 22: Cluster2

22

Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.

Decision trees

Salary < 1 M

Prof = teacher

Good

Age < 30

BadBad Good

Page 23: Cluster2

23

Pros and Cons of decision trees

• Cons– Cannot handle complicated relationship between features– simple decision boundaries– problems with lots of missing data

• Pros+ Reasonable training time+ Fast application+ Easy to interpret+ Easy to implement+ Can handle large number of features

Page 24: Cluster2

24

Neural networkSet of nodes connected by directed

weighted edges

Hidden nodes

Output nodes

x1

x2

x3

x1

x2

x3

w1

w2

w3

y

n

iii

ey

xwo

1

1)(

)(1

Basic NN unitA more typical NN

Page 25: Cluster2

25

Pros and Cons of Neural Network

• Cons– Slow training time– Hard to interpret – Hard to implement: trial and error for choosing number of nodes

• Pros+ Can learn more complicated class boundaries+ Fast application+ Can handle large number of features

Conclusion: Use neural nets only if decision trees/NN fail.

classification

Page 26: Cluster2

26

Part 4:Clustering

Partitioning clustering algorithmHierarchical clustering algorithm

Page 27: Cluster2

27

Clustering

Unsupervised learning when old data with class labels not available e.g. when introducing a new product.

Group/cluster existing customers based on time series of payment history such that similar customers in same cluster.

Key requirement: Need a good measure of similarity between instances.

Page 28: Cluster2

clustering

28

Page 29: Cluster2

Similarity

29

Page 30: Cluster2

30

Prevalent Interesting

Analysts already know about prevalent rules

Interesting rules are those that deviate from prior expectation

Mining’s payoff is in finding surprising phenomenon

1995

1998

Milk andcereal selltogether!

Zzzz... Milk andcereal selltogether!

Page 31: Cluster2

31

Clustering Algorithm

Partition clustering Algorithm

Hierarchical clustering algorithm

Page 32: Cluster2

32

Partition clustering Algorithm

Partition clustering algorithm generates a tree of clusters.

The number of cluster k is given by the user

Page 33: Cluster2

Hierarchical clustering algorithm

Hierarchical clustering algorithm generates a tree of clusters.

That is in the first step each cluster consists of single record.

In the second step,two cluster are grouped together

In the final step there is a single partition

33

Page 34: Cluster2

34

Part 6: Approaches to data mining problems

Discovery of sequentialDiscovery of patterns in time seriesDiscovery of classification rulesRegression

Page 35: Cluster2

Discovery of sequential patterns

35

Trans_id Time Item_Purchased

101 6.35 Milk, bread, juice

792 7.38 Milk, juice

1130 8.05 Milk, eggs

1735 8.40 Bread, cookies ,coffee

Suppose a customer visit the shop three times and purchase the following sequence of item sets.

{ milk, bread, juice }{ bread, eggs }{ cookies, milk, coffee }

The problem of discovering sequential patterns is to find all subsequences from the given sets of sequences that have a user defined minimum support.

Page 36: Cluster2

Discovery of patterns in time series

Time series are sequence of events having a fixed type of transaction.

The period during which the stock is raised steady for n days.

The longest period over which the stock and a change of not more than 1% over last closing price.

The quarter of a year during which the stock had the most percentage gain or loss.

36

Page 37: Cluster2

Discovery of classification rules

Classification is a process of defining a function that classifies a given object into many possible classes.

37

Page 38: Cluster2

Example

A bank wishes to classify its loan applicants into two groups or classes. A group who are loan worthy(eligible) Another group who are not worthy(not eligible)To do the above classification, the bank can use

the classification rule given below

If monthly income greater than 30,000 then they are worthyElse not worthy

38

Page 39: Cluster2

Regression

Regression is defined as a function over variables which gives a target class variable.

39

Page 40: Cluster2

Example

Labtest(Patient id,test1,test2,….testn)

This contain values of n test for one patient

The target variable that wish to predict is p, the probability of survival of the patient.

p=f{test1,test2,test3…testn}This function is called regression function.

40

Page 41: Cluster2

MSPVL Polytechnic college

41