data warehousing and data mining

62
A.A. 04-05 Datawarehousing & Datamining 1 Data Warehousing and Data Mining

Upload: tommy96

Post on 13-Jan-2015

1.010 views

Category:

Documents


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 1

Data Warehousing

and

Data Mining

Page 2: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 2

Outline

1. Introduction and Terminology

2. Data Warehousing3. Data Mining

• Association rules• Sequential patterns• Classification• Clustering

Page 3: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 3

Introduction and Terminology

Evolution of database technologyFile processing (60s)

Relational DBMS (70s)

Advanced data models e.g.,Object-Oriented,

application-oriented (80s)Web-Based Repositories (90s)

Data Warehousing andData Mining (90s)

Global/Integrated Information Systems (2000s)

Page 4: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 4

Introduction and Terminology

Major types of information systems within an organization

TRANSACTIONPROCESSING

SYSTEMSEnterprise Resource Planning (ERP)Customer Relationship Management (CRM)

KNOWLEDGELEVEL

SYSTEMSOffice automation, Groupware, ContentDistribution systems, Workflowmangement systems

MANAGEMENTLEVEL

SYSTEMSDecision Support Systems (DSS)Management Information Sys. (MIS)

EXECUTIVESUPPORTSYSTEMS

Fewsophisticated

users

Many un-sophisticated

users

Page 5: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 5

Introduction and Terminology

Transaction processing systems:• Support the operational level of the organization, possibly

integrating needs of different functional areas (ERP);• Perform and record the daily transactions necessary to the

conduct of the business• Execute simple read/update operations on traditional

databases, aiming at maximizing transaction throughput• Their activity is described as:

OLTP (On-Line Transaction Processing)

Page 6: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 6

Introduction and Terminology

Knowledge level systems: provide digital support for managingdocuments (office automation), user cooperation andcommunication (groupware), storing and retrievinginformation (content distribution), automation of businessprocedures (workflow management)

Management level systems: support planning, controlling andsemi-structured decision making at management level byproviding reports and analyses of current and historical data

Executive support systems: support unstructured decisionmaking at the strategic level of the organization

Page 7: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 7

Introduction and Terminology

DecisionMaking

Data PresentationReporting/Visualization engines

Data AnalysisOLAP, Data Mining engines

Data Warehouses / Data Marts

Data SourcesTransactional DB, ERP, CRM, Legacy systems

Multi-Tier architecture forManagement level andExecutive support systems

PRESENTATION

BUSINESSLOGIC

DATA

Page 8: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 8

Introduction and Terminology

OLAP (On-Line Analytical Processing):• Reporting based on (multidimensional) data analysis• Read-only access on repositories of moderate-large size

(typically, data warehouses), aiming at maximizingresponse time

Data Mining:• Discovery of novel, implicit patterns from, possibly

heterogeneous, data sources• Use a mix of sophisticated statistical and high-

performance computing techniques

Page 9: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 9

Outline

1. Introduction and Terminology

2. Data Warehousing3. Data Mining

• Association rules• Sequential patterns• Classification• Clustering

Page 10: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 10

Data Warehousing

DATA WAREHOUSEDatabase with the following distinctive characteristics:• Separate from operational databases• Subject oriented: provides a simple, concise view on one or

more selected areas, in support of the decision process• Constructed by integrating multiple, heterogeneous data

sources• Contains historical data: spans a much longer time horizon

than operational databases• (Mostly) Read-Only access: periodic, infrequent updates

Page 11: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 11

Data Warehousing

Types of Data Warehouses

• Enterprise Warehouse: covers all areas of interest for anorganization

• Data Mart: covers a subset of corporate-wide data that isof interest for a specific user group (e.g., marketing).

• Virtual Warehouse: offers a set of views constructed ondemand on operational databases. Some of the views couldbe materialized (precomputed)

Page 12: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 12

Data Warehousing

Multi-Tier Architecture

DBDBData Warehouse Server

AnalysisReporting

Data Mining

Data sources Data StorageOLAPengine

Front-EndTools

Cleaningextraction

Page 13: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 13

Data Warehousing

Multidimensional (logical) ModelData are organized around one or more FACT TABLEs. EachFact Table collects a set of omogeneous events (facts)characterized by dimensions and dependent attributesExample: Sales at a chain of stores

10030

Units

9000€2qtrSt3S1P21500€1qtrSt1S1P1SalesPeriodStoreSupplierProduct

dimensions dependent attributes

Page 14: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 14

Data Warehousing

Multidimensional (logical) Model (cont’d)Each dimension can in turn consist of a number of attributes.In this case the value in the fact table is a foreign key referringto an appropriate dimension table

AddressNameCode

supplier

DescriptionCode

product

AddressManager

NameCode

Store

Units

StorePeriod

Sales

SupplierProduct

Fact table

dimensiontable

dimensiontable

dimensiontable

STARSCHEMA

Page 15: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 15

Data Warehousing

Conceptual Star Schema (E-R Model)

FACTsPRODUCT (1,1) (0,N)(0,N)

(0,N)

(1,1)

(1,1)STORE

SUPPLIER

PERIOD

(1,1)

(0,N)

Page 16: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 16

Data Warehousing

OLAP Server ArchitecturesThey are classified based on the underlying storage layoutsROLAP (Relational OLAP): uses relational DBMS to storeand manage warehouse data (i.e., table-orientedorganization), and specific middleware to support OLAPqueries.MOLAP (Multidimensional OLAP): uses array-based datastructures and pre-computed aggregated data. It shows higherperformance than OLAP but may not scale well if not properlyimplementedHOLAP (Hybird OLAP): ROLAP approach for low-level rawdata, MOLAP approach for higher-level data (aggregations).

Page 17: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 17

Data Warehousing

MOLAP ApproachPeriod

Produc

t

Regio

nCD

TV

DVDPC

1Qtr 2Qtr 3Qtr 4QtrEurope

North America

Middle East

Far East

DATA CUBErepresenting a

Fact Table

Total sales of PCs ineurope in the 4th

quarter of the year

Page 18: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 18

Data Warehousing

OLAP Operations: SLICE

Fix values for one ormore dimensionsE.g. Product = CD

Page 19: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 19

Data Warehousing

OLAP Operations: DICE

Fix ranges for one ormore dimensionsE.g.Product = CD or DVDPeriod = 1Qrt or 2QrtRegion = Europe orNorth America

Page 20: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 20

Data Warehousing

OLAP Operations: Roll-UpPeriod

Produc

t

Region

CD

TV

DVDPC

yearEurope

North America

Middle East

Far East

Aggregate data by groupingalong one (or more)dimensionsE.g.: group quarters

Drill-Down = (Roll-Up)-1

Page 21: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 21

Data Warehousing

Cube Operator: summaries for each subset of dimensions

North America

Middle East

CD

TV

DVDPC

1Qtr 2Qtr 3Qtr 4QtrEurope

Far East

SUM

SUM

SUM

Yearly sales ofelectronics in themiddle east

Yearly sales of PCsin the middle east

Page 22: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 22

Data Warehousing

Cube Operator: it is equivalent to computing the followinglattice of cuboids

FACT TABLE

product, period product, region period, region

product period region

product, period, region

Page 23: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 23

Data Warehousing

Cube Operator In SQL:

SELECT Product, Period, Region, SUM(Total Sales)FROM FACT-TABLEGROUP BY Product, Period, RegionWITH CUBE

Page 24: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 24

Data Warehousing

*…ALLALL

*ALLALLALL

*EuropeALLALL

*………

*ALL1qtrALL

*………

*ALLALLCD

*…ALL…

*EuropeALLCD

*……ALL

*Europe1qtrALL

*ALL……

*ALL1qtrCD

*………

*Europe1qtrCD

Tot. SalesRegionPeriodProduct

All combinations of Product,Period and RegionAll combinations of Productand Period

All combinations of Product

Page 25: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 25

Data Warehousing

Roll-up (= partial cube) Operator In SQL:

SELECT Product, Period, Region, SUM(Total Sales)FROM FACT-TABLEGROUP BY Product, Period, RegionWITH ROLL-UP

Page 26: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 26

Data Warehousing

*ALLALLALL

*ALLALL…

*ALLALLCD

*ALL……

*ALL1qtrCD

*………

*Europe1qtrCD

Tot. SalesRegionPeriodProduct

All combinations of Product,Period and RegionAll combinations of Productand PeriodAll combinations of Product

Reduces the complexity from exponential to linear in thenumber of dimensions

Page 27: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 27

Data Warehousing

it is equivalent to computing the following subset of the latticeof cuboids

product, period product, region period, region

product period region

product, period, region

Page 28: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 28

Outline

1. Introduction and Terminology

2. Data Warehousing3. Data Mining

• Association rules• Sequential patterns• Classification• Clustering

Page 29: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 29

Data Mining

Data Explosion: tremendous amount of data accumulated indigital repositories around the world (e.g., databases, datawarehouses, web, etc.)

Production of digital data /Year:• 3-5 Exabytes (1018 bytes) in 2002• 30% increase per year (99-02)

We are drowning in data, but starving for knowledge

See:

www.sims.berkeley.edu/how-much-info

Page 30: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 30

Data Mining

Knowledge Discovery in Databases

Data Warehouse

DBDBInformation repositories

Trainingdata

KNOWLEDGE

Data cleaning& integration

Selection &transformation

Data MiningEvaluation &presentation

Domainknowledge

Page 31: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 31

Data Mining

Typologies of input data:

• Unaggregated data (e.g., records, transactions)• Aggregated data (e.g., summaries)• Spatial, geographic data• Data from time-series databases• Text, video, audio, web data

Page 32: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 32

Data Mining

DATA MININGProcess of discovering interesting patterns orknowledge from a (typically) large amount of datastored either in databases, data warehouses, orother information repositories

Alternative names: knowledge discovery/extraction,information harvesting, business intelligenceIn fact, data mining is a step of the more general processof knowledge discovery in databases (KDD)

Interesting: non-trivial, implicit, previously unknown,potentially useful

Page 33: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 33

Data Mining

Interestingness measuresPurpose: filter irrelevant patterns to convey concise anduseful knowledge. Certain data mining tasks can producethousands or millions of patterns most of which areredundant, trivial, irrelevant.Objective measures: based on statistics and structure ofpatterns (e.g., frequency counts)Subjective measures: based on user’s belief about thedata. Patterns may become interesting if they confirm orcontradict a user’s hypothesis, depending on the context.Interestingness measures can employed both after and duringthe pattern discovery. In the latter case, they improve thesearch efficiency

Page 34: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 34

Data Mining

Data Mining

Algorithms

Multidisciplinarity of Data Mining

StatisticsHigh-

PerformanceComputing

Machinelearning DB

TechnologyVisualization

Page 35: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 35

Data Mining

Data Mining Problems

Association Rules: discovery of rules X ⇒Y (“objects thatsatisfy condition X are also likely to satisfy condition Y”). Theproblem first found application in market basket or transactiondata analysis, where “objects” are transactions and“conditions” are containment of certain itemsets

Page 36: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 36

Association Rules

Statement of the ProblemI =Set of itemsD =Set of transactions: t ∈ D then t ⊆ IFor an itemset X ⊆ I,support(X) = fraction or number of transactions containing XASSOCIATION RULE: X-Y ⇒⇒⇒⇒ Y, with Y ⊂ X ⊆ I

support = support(X)confidence = support(X)/support(X-Y)

PROBLEM: find all association rules with support ≥ than minsup and confidence ≥ than min confidence

Page 37: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 37

Association Rules

Market Basket Analysis:Items = products sold in a store or chain of storesTransactions = customers’ shopping basketsRule X-Y⇒⇒⇒⇒Y = customers who buy items in X-Y are likely to

buy items in Y

Of diapers and beer …..Analysis of customers behaviour in a supermarket chain hasrevealed that males who on thursdays and saturdays buydiapers are likely to buy also beer…. That’s why these two items are found close to eachother in most stores

Page 38: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 38

Association Rules

Applications:

• Cross/Up-selling (especially in e-comm., e.g., Amazon)� Cross-selling: push complemetary products� Up-selling: push similar products

• Catalog design• Store layout (e.g., diapers and beer close to each other)• Financial forecast• Medical diagnosis

Page 39: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 39

Association Rules

Def.: Frequent itemset = itemset with support ≥ min supGeneral Strategy to discover all association rules:1. Find all frequent itemsets2. ∀ frequent itemset X, output all rules X-Y⇒⇒⇒⇒Y, with Y⊂X,

which satisfy the min confindence constraintObservation:Min sup and min confidence are objective measures ofinterestingness. Their proper setting, however, requiresuser’s domain knowledge. Low values may yield exponentially(in |I|) many rules, high values may cut off interesting rules

Page 40: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 40

Association Rules

Example

For rule {A} ⇒ {C} :support = support({A,C}) = 50%confidence = support({A,C})/support({A}) = 66.6%

For rule {C} ⇒ {A} (same support as {A} ⇒ {C}):confidence = support({A,C})/support({C}) = 100%

Min. sup 50%Min. confidence 50%

B, E, F40A, D30A, C20

A, B, C10Items boughtTransaction-id

50%{A, C}50%{C}50%{B}75%{A}

SupportFrequent Itemsets

Page 41: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 41

Association Rules

Dealing with Large OutputsObservation: depending on the values of min sup and onthe dataset, the number frequent itemsets can beexponentially large but may contain a lot of redundancy

Goal: determine a subset of frequent itemsets ofconsiderably smaller size, which provides the sameinformation content, i.e., from which the complete set offrequent itemsets can be derived without further information.

Page 42: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 42

Association Rules

Notion of ClosednessI (items), D (transactions), min sup (support threshold)Closed Frequent Itemsets =

{X⊆I : supp(X)≥min sup & supp(Y)<supp(X) ∀ I⊇Y⊃X}• It’s a subset of all frequent itemsets• For every frequent itemset X there exists a closed

frequent itemset Y⊇X such that supp(Y)=supp(X), i.e., Yand X occur in exactly the same transcations

• All frequent itemsets and their frequencies can be derivedfrom the closed frequent itemsets without furtherinformation

Page 43: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 43

Association Rules

Notion of MaximalityMaximal Frequent Itemsets =

{X⊆I : supp(X)≥min sup & supp(Y)<min sup ∀ I⊇Y⊃X}• It’s a subset of the closed frequent itemsets, hence of all

frequent itemsets• For every (closed) frequent itemset X there exists a

maximal frequent itemset Y⊇X• All frequent itemsets can be derived from the maximal

frequent itemsets without further information, howevertheir frequencies must be determined from D⇒⇒⇒⇒ information loss

Page 44: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 44

Association Rules

B, L, G, H40B, A, D, F50

B, A, D, F, L, G60

B, A, D, F, L, H30B, L20

B, A, D, F, G, H10ItemsTid

YESYESYESYESNO

Maximal

33446

Support

10, 40, 60B, G10, 30, 40B, H

20, 30, 40, 60B, L10, 30, 50, 60B, A, D, F

allB

SupportingTransactions

Closed FrequentItemsets

Min sup = 3DATASET

Example of closed andmaximal frequent itemsets

Page 45: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 45

Association Rules

(B,6) (A,4) (D,4) (F,4) (L,4) (G,3) (H,3)

(B,4) (B,4) (A,4) (B,4) (A,4) (D,4) (B,4) (B,3) (B,3)

(B,4) (B,4) (B,4) (A,4)

(B,4)

(•,6)

= closed & maximal

= closed but not maximal

(All frequent) VS (closed frequent) VS (maximal frequent)

Page 46: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 46

Sequential Patterns

Data Mining Problems

Sequential Patterns: discovery of frequent subsequences ina collection of sequences (sequence database), eachrepresenting a set of events occurring at subsequent times.The ordering of the events in the subsequences is relevant.

Page 47: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 47

Sequential Patterns

Sequential PatternsPROBLEM: Given a set of sequences, find the complete setof frequent subsequences

sequence database A sequence: <(ef)(ab)(df)(c)(b)>

An element may contain a set of items.Items within an element are unorderedand are listed alphabetically.

<(a)(bc)(d)(c)>is a subsequence of<(<(a)(abc)(ac)(d)(cf)>

For min sup = 2, <(ab)(c)> is a frequent subsequence

<(e)(g)(af)(c)(b)(c)>40<(ef)(ab)(df)(c)(b)>30<(ad)(c)(bc)(ae)>20

<(a)(abc)(ac)(d)(cf)>10sequenceSID

Page 48: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 48

Sequential Patterns

Applications:

• Marketing• Natural disaster forecast• Analysis of web log data• DNA analysis

Page 49: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 49

Classification

Data Mining Problems

Classification/Regression: discovery of a model or functionthat maps objects into predefined classes (classification) orinto suitable values (regression). The model/function iscomputed on a training set (supervised learning)

Page 50: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 50

Classification

Statement of the ProblemTraining Set: T = {t1, …, tn} set of n examplesEach example ti

• characterized by m features (ti(A1), …, ti(Am))• belongs to one of k classes (Ci : 1 ≤ i ≤ k)

GOALFrom the training data find a model to describe the classesaccurately and synthetically using the data’s features. Themodel will then be used to assign class labels to unknown(previously unseen) records

Page 51: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 51

Classification

Applications:

• Classification of (potential) customers for: Credit approval,risk prediction, selective marketing

• Performance prediction based on selected indicators• Medical diagnosis based on symptoms or reactions to

therapy

Page 52: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 52

Classification

Classification process

Training Set

Test Data

Unseen Data

VALIDATE

BUILD

PREDICTclass labels

model

Page 53: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 53

Classification

Observations:• Features can be either categorical if belonging to

unordered domains (e.g., Car Type), or continuous, ifbelonging to ordered domains (e.g., Age)

• The class could be regarded as an additional attribute ofthe examples, which we want to predict

• Classification vs Regression:Classification: builds models for categorical classesRegression: builds models for continuous classes

• Several types of models exist: decision trees, neuralnetworks, bayesan (statistical) classifiers.

Page 54: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 54

Classification

Classification using decision treesDefinition: Decision Tree for a training set T• Labeled tree• Each internal node v represents a test on a feature. The

edges from v to its children are labeled with mutuallyexclusive results of the test

• Each leaf w represents the subset of examples of Twhose features values are consistent with the testresults found along the path from the root to w. Theleaf is labeled with the majority class of the examples itcontains

Page 55: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 55

Classification

Example:examples = car insurance applicants, class = insurance risk

highfamily20lowtruck32lowfamily68highsports43highsports18highfamily23RiskCar TypeAge

features classAge < 25

Car type ∈{sports}

High

High Low

Model(decision tree)

Page 56: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 56

Clustering

Data Mining Problems

Clustering: grouping objects into classes with theobjective of maximizing intra-class similarity andminimizing inter-class similarity (unsupervised learning)

Page 57: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 57

Clustering

Statement of the ProblemGIVEN: N objects, each characterized by p attributes(a.k.a. variables)GROUP: the objects into K clusters featuring• High intra-cluster similarity• Low inter-cluster similarityRemark: Clustering is an instance of unsupervised learning orlearning by observations, as opposed to supervised learningor learning by examples (classification)

Page 58: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 58

Clustering

Exampleoutlier

attribute X

attribute Y

attribute X

attribute Y

object

Page 59: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 59

Clustering

Several types of clustering problems exist dependingon the specific input/output requirements, and on thenotion of similarity:• The number of clusters K may be provided in input or not• As output, for each cluster one may want a representative

object, or a set of aggregate measurements, or thecomplete set of objects belonging to the cluster

• Distance-based clustering: similarity of objects is related tosome kind of geometric distance

• Conceptual clustering: a group of objects forms a cluster ifthey define a certain concept

Page 60: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 60

Clustering

Applications:

• Marketing: identify groups of customers based on theirpurchasing patterns

• Biology: categorize genes with similar functionalities• Image processing: clustering of pixels• Web: clustering of documents in meta-search engines• GIS: identification of areas of similar land use

Page 61: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 61

Clustering

Challenges:

• Scalability: strategies to deal with very large datasets• Variety of attribute types: defining a good notion of

similarity is hard in the presence of different types ofattributes and/or different scales of values

• Variety of cluster shapes: common distance measuresprovide only spherical clusters

• Noisy data: outliers may affect the quality of the clustering• Sensitivity to input ordering: the ordering of the input data

should not affect the output (or its quality)

Page 62: Data Warehousing and Data Mining

A.A. 04-05 Datawarehousing & Datamining 62

Clustering

Main Distance-Based Clustering MethodsPartitioning Methods: create an initial partition of the objectsinto K clusters, and refine the clustering using iterativerelocation techniques. A cluster is represented either by themean value of its component objects or by a centrally locatedcomponent object (medoid)Hierarchical Methods: start with all objects belonging todistinct clusters and then successively merge the pair ofclosest clusters/objects into one single cluster (agglomerativeapproach); or start with all objects belonging to one clusterand successively split up a cluster into smaller ones (divisiveapproach)