recevait

49
Data Mining in Large Databases (Contributing Slides by Gregory Piatetsky-Shapiro and Rajeev Rastogi and Kyuseok Shim Lucent Bell laboratories)

Upload: tommy96

Post on 21-May-2015

282 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: recevait

Data Mining inLarge Databases

(Contributing Slides by Gregory Piatetsky-Shapiro and

Rajeev Rastogi and Kyuseok ShimLucent Bell laboratories)

Page 2: recevait

Overview

Introduction Association Rules Classification Clustering

Page 3: recevait

Background

Corporations have huge databases containing a wealth of information

Business databases potentially constitute a goldmine of valuable business information

Very little functionality in database systems to support data mining applications

Data mining: The efficient discovery of previously unknown patterns in large databases

Page 4: recevait

Applications

Fraud Detection Loan and Credit Approval Market Basket Analysis Customer Segmentation Financial Applications E-Commerce Decision Support Web Search

Page 5: recevait

Data Mining Techniques

Association Rules Sequential Patterns Classification Clustering Similar Time Sequences Similar Images Outlier Discovery Text/Web Mining

Page 6: recevait

Examples of Patterns Association rules

98% of people who purchase diapers buy beer Classification

People with age less than 25 and salary > 40k drive sports cars

Similar time sequences Stocks of companies A and B perform similarly

Outlier Detection Residential customers with businesses at home

Page 7: recevait

Association Rules

Given: A database of customer transactions Each transaction is a set of items

Find all rules X => Y that correlate the presence of one set of items X with another set of items Y Any number of items in the consequent or

antecedent of a rule Possible to specify constraints on rules (e.g., find

only rules involving expensive imported products)

Page 8: recevait

Association Rules

Sample Applications Market basket analysis Attached mailing in direct marketing Fraud detection for medical insurance Department store floor/shelf planning

Page 9: recevait

Confidence and Support

A rule must have some minimum user-specified confidence1 & 2 => 3 has 90% confidence if when

a customer bought 1 and 2, in 90% of cases, the customer also bought 3.

A rule must have some minimum user-specified support1 & 2 => 3 should hold in some

minimum percentage of transactions to have business value

Page 10: recevait

Example

Example:

For minimum support = 50%, minimum confidence = 50%, we have the following rules1 => 3 with 50% support and 66% confidence3 => 1 with 50% support and 100% confidence

Transaction Id Purchased Items 1 {1, 2, 3}2 {1, 4}3 {1, 3}4 {2, 5, 6}

Page 11: recevait

Problem Decomposition

1. Find all sets of items that have minimum support Use Apriori Algorithm

2. Use the frequent itemsets to generate the desired rules Generation is straight forward

Page 12: recevait

Problem Decomposition - Example

TID Items1 {1, 2, 3}2 {1, 3}3 {1, 4}4 {2, 5, 6}Frequent Itemset Support

{1} 75%{2} 50%{3} 50%{1, 3} 50%

For minimum support = 50% and minimum confidence = 50%

For the rule 1 => 3:•Support = Support({1, 3}) = 50%•Confidence = Support({1,3})/Support({1}) = 66%

Page 13: recevait

The Apriori Algorithm

Fk : Set of frequent itemsets of size k Ck : Set of candidate itemsets of size k F1 = {large items}for ( k=1; Fk != 0; k++) do { Ck+1 = New candidates generated from Fk foreach transaction t in the database do Increment the count of all candidates in Ck+1 that are contained in t Fk+1 = Candidates in Ck+1 with minimum support }Answer = Uk Fk

Page 14: recevait

Key Observation

Every subset of a frequent itemset is also frequent

=> a candidate itemset in Ck+1 can be pruned if even one of its subsets is not contained in Fk

Page 15: recevait

Apriori - Example

TID Items1 {1, 3, 4}2 {2, 3, 5}3 {1, 2, 3, 5}4 {2, 5}

Itemset Sup.{1} 2{2} 3{3} 3{4} 1{5} 3

Itemset Sup.{2} 3{3} 3{5} 3

Itemset{2, 3}{2, 5}{3, 5}

{2, 3} 2{2, 5} 3{3, 5} 2

Itemset Sup.{2, 5} 3

Database D C1 F1

C2 C2 F2

Scan D

Scan D

Page 16: recevait

Sequential Patterns

Given: A sequence of customer transactions Each transaction is a set of items

Find all maximal sequential patterns supported by more than a user-specified percentage of customers

Example: 10% of customers who bought a PC did a memory upgrade in a subsequent transaction

Page 17: recevait

Classification Given:

Database of tuples, each assigned a class label

Develop a model/profile for each class Example profile (good credit): (25 <= age <= 40 and income > 40k) or

(married = YES)

Sample applications: Credit card approval (good, bad) Bank locations (good, fair, poor) Treatment effectiveness (good, fair, poor)

Page 18: recevait

Decision Tree

An internal node is a test on an attribute. A branch represents an outcome of the test, e.g.,

Color=red. A leaf node represents a class label or class label

distribution. At each node, one attribute is chosen to split

training examples into distinct classes as much as possible

A new case is classified by following a matching path to a leaf node.

Page 19: recevait

Decision TreesOutlook Temperature Humidity Windy Play?

sunny hot high false No

sunny hot high true No

overcast hot high false Yes

rain mild high false Yes

rain cool normal false Yes

rain cool normal true No

overcast cool normal true Yes

sunny mild high false No

sunny cool normal false Yes

rain mild normal false Yes

sunny mild normal true Yes

overcast mild high true Yes

overcast hot normal false Yes

rain mild high true No

Page 20: recevait

Example Tree

overcast

high normal falsetrue

sunny rain

No NoYes Yes

Yes

Outlook

HumidityWindy

Page 21: recevait

Decision Tree Algorithms

Building phase Recursively split nodes using best

splitting attribute for node Pruning phase

Smaller imperfect decision tree generally achieves better accuracy

Prune leaf nodes recursively to prevent over-fitting

Page 22: recevait

Attribute Selection

Which is the best attribute? The one which will result in the smallest tree Heuristic: choose the attribute that produces the

“purest” nodes Popular impurity criterion: information gain

Information gain increases with the average purity of the subsets that an attribute produces

Strategy: choose attribute that results in greatest information gain

Page 23: recevait

Which attribute to select?

Page 24: recevait

Computing information Information is measured in bits

Given a probability distribution, the info required to predict an event is the distribution’s entropy

Entropy gives the information required in bits (this can involve fractions of bits!)

Formula for computing the entropy:

nnn ppppppppp logloglog),,,entropy( 221121

Page 25: recevait

Example: attribute “Outlook”

“ Outlook” = “Sunny”:

“Outlook” = “Overcast”:

“Outlook” = “Rainy”:

Expected information for attribute:

bits 971.0)5/3log(5/3)5/2log(5/25,3/5)entropy(2/)info([2,3]

bits 0)0log(0)1log(10)entropy(1,)info([4,0]

bits 971.0)5/2log(5/2)5/3log(5/35,2/5)entropy(3/)info([3,2]

971.0)14/5(0)14/4(971.0)14/5([3,2])[4,0],,info([3,2] bits 693.0

Page 26: recevait

Computing the information gain Information gain: (information before split) – (information after split)

Information gain for attributes from weather data: 0.693-0.940[3,2])[4,0],,info([2,3]-)info([9,5])Outlook"gain(" bits 247.0

bits 247.0)Outlook"gain(" bits 029.0)e"Temperaturgain("

bits 152.0)Humidity"gain(" bits 048.0)Windy"gain("

Page 27: recevait

Continuing to split

bits 571.0)e"Temperaturgain(" bits 971.0)Humidity"gain("

bits 020.0)Windy"gain("

Page 28: recevait

The final decision tree

Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can’t be split any further

Page 29: recevait

Decision Trees

Pros Fast execution time Generated rules are easy to interpret

by humans Scale well for large data sets Can handle high dimensional data

Cons Cannot capture correlations among

attributes Consider only axis-parallel cuts

Page 30: recevait

Clustering

Given: Data points and number of desired clusters K

Group the data points into K clusters Data points within clusters are more similar

than across clusters Sample applications:

Customer segmentation Market basket customer analysis Attached mailing in direct marketing Clustering companies with similar growth

Page 31: recevait

Traditional Algorithms

Partitional algorithms

Enumerate K partitions optimizing some criterion

Example: square-error criterion

mi is the mean of cluster Ci

k

i p Ci

i

mp1

2

Page 32: recevait

K-means Algorithm

Assign initial means Assign each point to the cluster for the

closest mean Compute new mean for each cluster Iterate until criterion function converges

Page 33: recevait

K-means example, step 1

k1

k2

k3

X

Y

Pick 3 initialclustercenters(randomly)

Page 34: recevait

K-means example, step 2

k1

k2

k3

X

Y

Assigneach pointto the closestclustercenter

Page 35: recevait

K-means example, step 3

X

Y

Moveeach cluster centerto the meanof each cluster

k1

k2

k2

k1

k3

k3

Page 36: recevait

K-means example, step 4

X

Y

Reassignpoints closest to a different new cluster center

Q: Which points are reassigned?

k1

k2

k3

Page 37: recevait

K-means example, step 4 …

X

Y

A: three points with animation

k1

k3k2

Page 38: recevait

K-means example, step 4b

X

Y

re-compute cluster means

k1

k3k2

Page 39: recevait

K-means example, step 5

X

Y

move cluster centers to cluster means

k2

k1

k3

Page 40: recevait

Discussion Result can vary significantly depending

on initial choice of seeds Can get trapped in local minimum

Example:

To increase chance of finding global optimum: restart with different random seeds

instances

initial cluster centers

Page 41: recevait

K-means clustering summary

Advantages Simple,

understandable items automatically

assigned to clusters

Disadvantages Must pick number

of clusters before hand

All items forced into a cluster

Too sensitive to outliers

Page 42: recevait

Traditional Algorithms

Hierarchical clustering

Nested Partitions Tree structure

Page 43: recevait

Agglomerative Hierarchcal Algorithms

Mostly used hierarchical clustering algorithm

Initially each point is a distinct cluster Repeatedly merge closest clusters until

the number of clusters becomes K Closest: dmean (Ci, Cj) = dmin (Ci, Cj) = Likewise dave (Ci, Cj) and dmax (Ci, Cj)

mm jiqp

CC jiqp

min,

Page 44: recevait

Similar Time Sequences Given:

A set of time-series sequences Find

All sequences similar to the query sequence

All pairs of similar sequenceswhole matching vs. subsequence matching

Sample Applications Financial market Scientific databases Medical Diagnosis

Page 45: recevait

Whole Sequence Matching

Basic Idea Extract k features from every sequence Every sequence is then represented as a

point in k-dimensional space Use a multi-dimensional index to store and

search these points Spatial indices do not work well for high

dimensional data

Page 46: recevait

Similar Time Sequences

Take Euclidean distance as the similarity measure Obtain Discrete Fourier Transform (DFT) coefficients of

each sequence in the database Build a multi-dimensional index using first a few Fourier

coefficients Use the index to retrieve sequences that are at most

distance away from query sequence Post-processing:

compute the actual distance between sequences in the time domain

Page 47: recevait

Outlier Discovery

Given: Data points and number of outliers (= n) to find

Find top n outlier points outliers are considerably dissimilar from the

remainder of the data Sample applications:

Credit card fraud detection Telecom fraud detection Medical analysis

Page 48: recevait

Statistical Approaches

Model underlying distribution that generates dataset (e.g. normal distribution)

Use discordancy tests depending on data distribution distribution parameter (e.g. mean, variance) number of expected outliers

Drawbacks most tests are for single attribute In many cases, data distribution may not be

known

Page 49: recevait

Distance-based Outliers

For a fraction p and a distance d, a point o is an outlier if p points lie at a

greater distance than d General enough to model statistical outlier

tests Develop nested-loop and cell-based

algorithms Scale okay for large datasets Cell-based algorithm does not scale well for

high dimensions