recevait

Data Mining inLarge Databases

(Contributing Slides by Gregory Piatetsky-Shapiro and

Rajeev Rastogi and Kyuseok ShimLucent Bell laboratories)

Overview

Introduction Association Rules Classification Clustering

Background

Corporations have huge databases containing a wealth of information

Business databases potentially constitute a goldmine of valuable business information

Very little functionality in database systems to support data mining applications

Data mining: The efficient discovery of previously unknown patterns in large databases

Applications

Fraud Detection Loan and Credit Approval Market Basket Analysis Customer Segmentation Financial Applications E-Commerce Decision Support Web Search

Data Mining Techniques

Association Rules Sequential Patterns Classification Clustering Similar Time Sequences Similar Images Outlier Discovery Text/Web Mining

Examples of Patterns Association rules

98% of people who purchase diapers buy beer Classification

People with age less than 25 and salary > 40k drive sports cars

Similar time sequences Stocks of companies A and B perform similarly

Outlier Detection Residential customers with businesses at home

Association Rules

Given: A database of customer transactions Each transaction is a set of items

Find all rules X => Y that correlate the presence of one set of items X with another set of items Y Any number of items in the consequent or

antecedent of a rule Possible to specify constraints on rules (e.g., find

only rules involving expensive imported products)

Association Rules

Sample Applications Market basket analysis Attached mailing in direct marketing Fraud detection for medical insurance Department store floor/shelf planning

Confidence and Support

A rule must have some minimum user-specified confidence1 & 2 => 3 has 90% confidence if when

a customer bought 1 and 2, in 90% of cases, the customer also bought 3.

A rule must have some minimum user-specified support1 & 2 => 3 should hold in some

minimum percentage of transactions to have business value

Example

Example:

For minimum support = 50%, minimum confidence = 50%, we have the following rules1 => 3 with 50% support and 66% confidence3 => 1 with 50% support and 100% confidence

Transaction Id Purchased Items 1 {1, 2, 3}2 {1, 4}3 {1, 3}4 {2, 5, 6}

Problem Decomposition

1. Find all sets of items that have minimum support Use Apriori Algorithm

2. Use the frequent itemsets to generate the desired rules Generation is straight forward

Problem Decomposition - Example

TID Items1 {1, 2, 3}2 {1, 3}3 {1, 4}4 {2, 5, 6}Frequent Itemset Support

{1} 75%{2} 50%{3} 50%{1, 3} 50%

For minimum support = 50% and minimum confidence = 50%

For the rule 1 => 3:•Support = Support({1, 3}) = 50%•Confidence = Support({1,3})/Support({1}) = 66%

The Apriori Algorithm

Fk : Set of frequent itemsets of size k Ck : Set of candidate itemsets of size k F1 = {large items}for ( k=1; Fk != 0; k++) do { Ck+1 = New candidates generated from Fk foreach transaction t in the database do Increment the count of all candidates in Ck+1 that are contained in t Fk+1 = Candidates in Ck+1 with minimum support }Answer = Uk Fk

Key Observation

Every subset of a frequent itemset is also frequent

=> a candidate itemset in Ck+1 can be pruned if even one of its subsets is not contained in Fk

Apriori - Example

TID Items1 {1, 3, 4}2 {2, 3, 5}3 {1, 2, 3, 5}4 {2, 5}

Itemset Sup.{1} 2{2} 3{3} 3{4} 1{5} 3

Itemset Sup.{2} 3{3} 3{5} 3

Itemset{2, 3}{2, 5}{3, 5}

{2, 3} 2{2, 5} 3{3, 5} 2

Itemset Sup.{2, 5} 3

Database D C1 F1

C2 C2 F2

Scan D

Scan D

Sequential Patterns

Given: A sequence of customer transactions Each transaction is a set of items

Find all maximal sequential patterns supported by more than a user-specified percentage of customers

Example: 10% of customers who bought a PC did a memory upgrade in a subsequent transaction

Classification Given:

Database of tuples, each assigned a class label

Develop a model/profile for each class Example profile (good credit): (25 <= age <= 40 and income > 40k) or

(married = YES)

Sample applications: Credit card approval (good, bad) Bank locations (good, fair, poor) Treatment effectiveness (good, fair, poor)

Decision Tree

An internal node is a test on an attribute. A branch represents an outcome of the test, e.g.,

Color=red. A leaf node represents a class label or class label

distribution. At each node, one attribute is chosen to split

training examples into distinct classes as much as possible

A new case is classified by following a matching path to a leaf node.

Decision TreesOutlook Temperature Humidity Windy Play?

sunny hot high false No

sunny hot high true No

overcast hot high false Yes

rain mild high false Yes

rain cool normal false Yes

rain cool normal true No

overcast cool normal true Yes

sunny mild high false No

sunny cool normal false Yes

rain mild normal false Yes

sunny mild normal true Yes

overcast mild high true Yes

overcast hot normal false Yes

rain mild high true No

Example Tree

overcast

high normal falsetrue

sunny rain

No NoYes Yes

Yes

Outlook

HumidityWindy

Decision Tree Algorithms

Building phase Recursively split nodes using best

splitting attribute for node Pruning phase

Smaller imperfect decision tree generally achieves better accuracy

Prune leaf nodes recursively to prevent over-fitting

Attribute Selection

Which is the best attribute? The one which will result in the smallest tree Heuristic: choose the attribute that produces the

“purest” nodes Popular impurity criterion: information gain

Information gain increases with the average purity of the subsets that an attribute produces

Strategy: choose attribute that results in greatest information gain

Which attribute to select?

Computing information Information is measured in bits

Given a probability distribution, the info required to predict an event is the distribution’s entropy

Entropy gives the information required in bits (this can involve fractions of bits!)

Formula for computing the entropy:

nnn ppppppppp logloglog),,,entropy( 221121

Example: attribute “Outlook”

“ Outlook” = “Sunny”:

“Outlook” = “Overcast”:

“Outlook” = “Rainy”:

Expected information for attribute:

bits 971.0)5/3log(5/3)5/2log(5/25,3/5)entropy(2/)info([2,3]

bits 0)0log(0)1log(10)entropy(1,)info([4,0]

bits 971.0)5/2log(5/2)5/3log(5/35,2/5)entropy(3/)info([3,2]

971.0)14/5(0)14/4(971.0)14/5([3,2])[4,0],,info([3,2] bits 693.0

Computing the information gain Information gain: (information before split) – (information after split)

Information gain for attributes from weather data: 0.693-0.940[3,2])[4,0],,info([2,3]-)info([9,5])Outlook"gain(" bits 247.0

bits 247.0)Outlook"gain(" bits 029.0)e"Temperaturgain("

bits 152.0)Humidity"gain(" bits 048.0)Windy"gain("

Continuing to split

bits 571.0)e"Temperaturgain(" bits 971.0)Humidity"gain("

bits 020.0)Windy"gain("

The final decision tree

Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can’t be split any further

Decision Trees

Pros Fast execution time Generated rules are easy to interpret

by humans Scale well for large data sets Can handle high dimensional data

Cons Cannot capture correlations among

attributes Consider only axis-parallel cuts

Clustering

Given: Data points and number of desired clusters K

Group the data points into K clusters Data points within clusters are more similar

than across clusters Sample applications:

Customer segmentation Market basket customer analysis Attached mailing in direct marketing Clustering companies with similar growth

Traditional Algorithms

Partitional algorithms

Enumerate K partitions optimizing some criterion

Example: square-error criterion

mi is the mean of cluster Ci

k

i p Ci

i

mp1

2

K-means Algorithm

Assign initial means Assign each point to the cluster for the

closest mean Compute new mean for each cluster Iterate until criterion function converges

K-means example, step 1

k1

k2

k3

X

Y

Pick 3 initialclustercenters(randomly)


k1

k2

k3

X

Y

Assigneach pointto the closestclustercenter


X

Y

Moveeach cluster centerto the meanof each cluster

k1

k2

k2

k1

k3

k3


X

Y

Reassignpoints closest to a different new cluster center

Q: Which points are reassigned?

k1

k2

k3

K-means example, step 4 …

X

Y

A: three points with animation

k1

k3k2

K-means example, step 4b

X

Y

re-compute cluster means

k1

k3k2


X

Y

move cluster centers to cluster means

k2

k1

k3

Discussion Result can vary significantly depending

on initial choice of seeds Can get trapped in local minimum

Example:

To increase chance of finding global optimum: restart with different random seeds

instances

initial cluster centers

K-means clustering summary

Advantages Simple,

understandable items automatically

assigned to clusters

Disadvantages Must pick number

of clusters before hand

All items forced into a cluster

Too sensitive to outliers

Traditional Algorithms

Hierarchical clustering

Nested Partitions Tree structure

Agglomerative Hierarchcal Algorithms

Mostly used hierarchical clustering algorithm

Initially each point is a distinct cluster Repeatedly merge closest clusters until

the number of clusters becomes K Closest: dmean (Ci, Cj) = dmin (Ci, Cj) = Likewise dave (Ci, Cj) and dmax (Ci, Cj)

mm jiqp

CC jiqp

min,

Similar Time Sequences Given:

A set of time-series sequences Find

All sequences similar to the query sequence

All pairs of similar sequenceswhole matching vs. subsequence matching

Sample Applications Financial market Scientific databases Medical Diagnosis

Whole Sequence Matching

Basic Idea Extract k features from every sequence Every sequence is then represented as a

point in k-dimensional space Use a multi-dimensional index to store and

search these points Spatial indices do not work well for high

dimensional data

Similar Time Sequences

Take Euclidean distance as the similarity measure Obtain Discrete Fourier Transform (DFT) coefficients of

each sequence in the database Build a multi-dimensional index using first a few Fourier

coefficients Use the index to retrieve sequences that are at most

distance away from query sequence Post-processing:

compute the actual distance between sequences in the time domain

Outlier Discovery

Given: Data points and number of outliers (= n) to find

Find top n outlier points outliers are considerably dissimilar from the

remainder of the data Sample applications:

Credit card fraud detection Telecom fraud detection Medical analysis

Statistical Approaches

Model underlying distribution that generates dataset (e.g. normal distribution)

Use discordancy tests depending on data distribution distribution parameter (e.g. mean, variance) number of expected outliers

Drawbacks most tests are for single attribute In many cases, data distribution may not be

known

Distance-based Outliers

For a fraction p and a distance d, a point o is an outlier if p points lie at a

greater distance than d General enough to model statistical outlier

tests Develop nested-loop and cell-based

algorithms Scale okay for large datasets Cell-based algorithm does not scale well for

high dimensions

recevait

Documents

minimum confidence

example example

following rules

minimum userspecifiedsupport

apriori example database

large items

desired rules generation

large databases