data warehousing and data mining
DESCRIPTION
TRANSCRIPT
A.A. 04-05 Datawarehousing & Datamining 1
Data Warehousing
and
Data Mining
A.A. 04-05 Datawarehousing & Datamining 2
Outline
1. Introduction and Terminology
2. Data Warehousing3. Data Mining
• Association rules• Sequential patterns• Classification• Clustering
A.A. 04-05 Datawarehousing & Datamining 3
Introduction and Terminology
Evolution of database technologyFile processing (60s)
Relational DBMS (70s)
Advanced data models e.g.,Object-Oriented,
application-oriented (80s)Web-Based Repositories (90s)
Data Warehousing andData Mining (90s)
Global/Integrated Information Systems (2000s)
A.A. 04-05 Datawarehousing & Datamining 4
Introduction and Terminology
Major types of information systems within an organization
TRANSACTIONPROCESSING
SYSTEMSEnterprise Resource Planning (ERP)Customer Relationship Management (CRM)
KNOWLEDGELEVEL
SYSTEMSOffice automation, Groupware, ContentDistribution systems, Workflowmangement systems
MANAGEMENTLEVEL
SYSTEMSDecision Support Systems (DSS)Management Information Sys. (MIS)
EXECUTIVESUPPORTSYSTEMS
Fewsophisticated
users
Many un-sophisticated
users
A.A. 04-05 Datawarehousing & Datamining 5
Introduction and Terminology
Transaction processing systems:• Support the operational level of the organization, possibly
integrating needs of different functional areas (ERP);• Perform and record the daily transactions necessary to the
conduct of the business• Execute simple read/update operations on traditional
databases, aiming at maximizing transaction throughput• Their activity is described as:
OLTP (On-Line Transaction Processing)
A.A. 04-05 Datawarehousing & Datamining 6
Introduction and Terminology
Knowledge level systems: provide digital support for managingdocuments (office automation), user cooperation andcommunication (groupware), storing and retrievinginformation (content distribution), automation of businessprocedures (workflow management)
Management level systems: support planning, controlling andsemi-structured decision making at management level byproviding reports and analyses of current and historical data
Executive support systems: support unstructured decisionmaking at the strategic level of the organization
A.A. 04-05 Datawarehousing & Datamining 7
Introduction and Terminology
DecisionMaking
Data PresentationReporting/Visualization engines
Data AnalysisOLAP, Data Mining engines
Data Warehouses / Data Marts
Data SourcesTransactional DB, ERP, CRM, Legacy systems
Multi-Tier architecture forManagement level andExecutive support systems
PRESENTATION
BUSINESSLOGIC
DATA
A.A. 04-05 Datawarehousing & Datamining 8
Introduction and Terminology
OLAP (On-Line Analytical Processing):• Reporting based on (multidimensional) data analysis• Read-only access on repositories of moderate-large size
(typically, data warehouses), aiming at maximizingresponse time
Data Mining:• Discovery of novel, implicit patterns from, possibly
heterogeneous, data sources• Use a mix of sophisticated statistical and high-
performance computing techniques
A.A. 04-05 Datawarehousing & Datamining 9
Outline
1. Introduction and Terminology
2. Data Warehousing3. Data Mining
• Association rules• Sequential patterns• Classification• Clustering
A.A. 04-05 Datawarehousing & Datamining 10
Data Warehousing
DATA WAREHOUSEDatabase with the following distinctive characteristics:• Separate from operational databases• Subject oriented: provides a simple, concise view on one or
more selected areas, in support of the decision process• Constructed by integrating multiple, heterogeneous data
sources• Contains historical data: spans a much longer time horizon
than operational databases• (Mostly) Read-Only access: periodic, infrequent updates
A.A. 04-05 Datawarehousing & Datamining 11
Data Warehousing
Types of Data Warehouses
• Enterprise Warehouse: covers all areas of interest for anorganization
• Data Mart: covers a subset of corporate-wide data that isof interest for a specific user group (e.g., marketing).
• Virtual Warehouse: offers a set of views constructed ondemand on operational databases. Some of the views couldbe materialized (precomputed)
A.A. 04-05 Datawarehousing & Datamining 12
Data Warehousing
Multi-Tier Architecture
DBDBData Warehouse Server
AnalysisReporting
Data Mining
Data sources Data StorageOLAPengine
Front-EndTools
Cleaningextraction
A.A. 04-05 Datawarehousing & Datamining 13
Data Warehousing
Multidimensional (logical) ModelData are organized around one or more FACT TABLEs. EachFact Table collects a set of omogeneous events (facts)characterized by dimensions and dependent attributesExample: Sales at a chain of stores
10030
Units
9000€2qtrSt3S1P21500€1qtrSt1S1P1SalesPeriodStoreSupplierProduct
dimensions dependent attributes
A.A. 04-05 Datawarehousing & Datamining 14
Data Warehousing
Multidimensional (logical) Model (cont’d)Each dimension can in turn consist of a number of attributes.In this case the value in the fact table is a foreign key referringto an appropriate dimension table
AddressNameCode
supplier
DescriptionCode
product
AddressManager
NameCode
Store
Units
StorePeriod
Sales
SupplierProduct
Fact table
dimensiontable
dimensiontable
dimensiontable
STARSCHEMA
A.A. 04-05 Datawarehousing & Datamining 15
Data Warehousing
Conceptual Star Schema (E-R Model)
FACTsPRODUCT (1,1) (0,N)(0,N)
(0,N)
(1,1)
(1,1)STORE
SUPPLIER
PERIOD
(1,1)
(0,N)
A.A. 04-05 Datawarehousing & Datamining 16
Data Warehousing
OLAP Server ArchitecturesThey are classified based on the underlying storage layoutsROLAP (Relational OLAP): uses relational DBMS to storeand manage warehouse data (i.e., table-orientedorganization), and specific middleware to support OLAPqueries.MOLAP (Multidimensional OLAP): uses array-based datastructures and pre-computed aggregated data. It shows higherperformance than OLAP but may not scale well if not properlyimplementedHOLAP (Hybird OLAP): ROLAP approach for low-level rawdata, MOLAP approach for higher-level data (aggregations).
A.A. 04-05 Datawarehousing & Datamining 17
Data Warehousing
MOLAP ApproachPeriod
Produc
t
Regio
nCD
TV
DVDPC
1Qtr 2Qtr 3Qtr 4QtrEurope
North America
Middle East
Far East
DATA CUBErepresenting a
Fact Table
Total sales of PCs ineurope in the 4th
quarter of the year
A.A. 04-05 Datawarehousing & Datamining 18
Data Warehousing
OLAP Operations: SLICE
Fix values for one ormore dimensionsE.g. Product = CD
A.A. 04-05 Datawarehousing & Datamining 19
Data Warehousing
OLAP Operations: DICE
Fix ranges for one ormore dimensionsE.g.Product = CD or DVDPeriod = 1Qrt or 2QrtRegion = Europe orNorth America
A.A. 04-05 Datawarehousing & Datamining 20
Data Warehousing
OLAP Operations: Roll-UpPeriod
Produc
t
Region
CD
TV
DVDPC
yearEurope
North America
Middle East
Far East
Aggregate data by groupingalong one (or more)dimensionsE.g.: group quarters
Drill-Down = (Roll-Up)-1
A.A. 04-05 Datawarehousing & Datamining 21
Data Warehousing
Cube Operator: summaries for each subset of dimensions
North America
Middle East
CD
TV
DVDPC
1Qtr 2Qtr 3Qtr 4QtrEurope
Far East
SUM
SUM
SUM
Yearly sales ofelectronics in themiddle east
Yearly sales of PCsin the middle east
A.A. 04-05 Datawarehousing & Datamining 22
Data Warehousing
Cube Operator: it is equivalent to computing the followinglattice of cuboids
FACT TABLE
product, period product, region period, region
∅
product period region
product, period, region
A.A. 04-05 Datawarehousing & Datamining 23
Data Warehousing
Cube Operator In SQL:
SELECT Product, Period, Region, SUM(Total Sales)FROM FACT-TABLEGROUP BY Product, Period, RegionWITH CUBE
A.A. 04-05 Datawarehousing & Datamining 24
Data Warehousing
*…ALLALL
*ALLALLALL
*EuropeALLALL
*………
*ALL1qtrALL
*………
*ALLALLCD
*…ALL…
*EuropeALLCD
*……ALL
*Europe1qtrALL
*ALL……
*ALL1qtrCD
*………
*Europe1qtrCD
Tot. SalesRegionPeriodProduct
All combinations of Product,Period and RegionAll combinations of Productand Period
All combinations of Product
A.A. 04-05 Datawarehousing & Datamining 25
Data Warehousing
Roll-up (= partial cube) Operator In SQL:
SELECT Product, Period, Region, SUM(Total Sales)FROM FACT-TABLEGROUP BY Product, Period, RegionWITH ROLL-UP
A.A. 04-05 Datawarehousing & Datamining 26
Data Warehousing
*ALLALLALL
*ALLALL…
*ALLALLCD
*ALL……
*ALL1qtrCD
*………
*Europe1qtrCD
Tot. SalesRegionPeriodProduct
All combinations of Product,Period and RegionAll combinations of Productand PeriodAll combinations of Product
Reduces the complexity from exponential to linear in thenumber of dimensions
A.A. 04-05 Datawarehousing & Datamining 27
Data Warehousing
it is equivalent to computing the following subset of the latticeof cuboids
product, period product, region period, region
∅
product period region
product, period, region
A.A. 04-05 Datawarehousing & Datamining 28
Outline
1. Introduction and Terminology
2. Data Warehousing3. Data Mining
• Association rules• Sequential patterns• Classification• Clustering
A.A. 04-05 Datawarehousing & Datamining 29
Data Mining
Data Explosion: tremendous amount of data accumulated indigital repositories around the world (e.g., databases, datawarehouses, web, etc.)
Production of digital data /Year:• 3-5 Exabytes (1018 bytes) in 2002• 30% increase per year (99-02)
We are drowning in data, but starving for knowledge
See:
www.sims.berkeley.edu/how-much-info
A.A. 04-05 Datawarehousing & Datamining 30
Data Mining
Knowledge Discovery in Databases
Data Warehouse
DBDBInformation repositories
Trainingdata
KNOWLEDGE
Data cleaning& integration
Selection &transformation
Data MiningEvaluation &presentation
Domainknowledge
A.A. 04-05 Datawarehousing & Datamining 31
Data Mining
Typologies of input data:
• Unaggregated data (e.g., records, transactions)• Aggregated data (e.g., summaries)• Spatial, geographic data• Data from time-series databases• Text, video, audio, web data
A.A. 04-05 Datawarehousing & Datamining 32
Data Mining
DATA MININGProcess of discovering interesting patterns orknowledge from a (typically) large amount of datastored either in databases, data warehouses, orother information repositories
Alternative names: knowledge discovery/extraction,information harvesting, business intelligenceIn fact, data mining is a step of the more general processof knowledge discovery in databases (KDD)
Interesting: non-trivial, implicit, previously unknown,potentially useful
A.A. 04-05 Datawarehousing & Datamining 33
Data Mining
Interestingness measuresPurpose: filter irrelevant patterns to convey concise anduseful knowledge. Certain data mining tasks can producethousands or millions of patterns most of which areredundant, trivial, irrelevant.Objective measures: based on statistics and structure ofpatterns (e.g., frequency counts)Subjective measures: based on user’s belief about thedata. Patterns may become interesting if they confirm orcontradict a user’s hypothesis, depending on the context.Interestingness measures can employed both after and duringthe pattern discovery. In the latter case, they improve thesearch efficiency
A.A. 04-05 Datawarehousing & Datamining 34
Data Mining
Data Mining
Algorithms
Multidisciplinarity of Data Mining
StatisticsHigh-
PerformanceComputing
Machinelearning DB
TechnologyVisualization
A.A. 04-05 Datawarehousing & Datamining 35
Data Mining
Data Mining Problems
Association Rules: discovery of rules X ⇒Y (“objects thatsatisfy condition X are also likely to satisfy condition Y”). Theproblem first found application in market basket or transactiondata analysis, where “objects” are transactions and“conditions” are containment of certain itemsets
A.A. 04-05 Datawarehousing & Datamining 36
Association Rules
Statement of the ProblemI =Set of itemsD =Set of transactions: t ∈ D then t ⊆ IFor an itemset X ⊆ I,support(X) = fraction or number of transactions containing XASSOCIATION RULE: X-Y ⇒⇒⇒⇒ Y, with Y ⊂ X ⊆ I
support = support(X)confidence = support(X)/support(X-Y)
PROBLEM: find all association rules with support ≥ than minsup and confidence ≥ than min confidence
A.A. 04-05 Datawarehousing & Datamining 37
Association Rules
Market Basket Analysis:Items = products sold in a store or chain of storesTransactions = customers’ shopping basketsRule X-Y⇒⇒⇒⇒Y = customers who buy items in X-Y are likely to
buy items in Y
Of diapers and beer …..Analysis of customers behaviour in a supermarket chain hasrevealed that males who on thursdays and saturdays buydiapers are likely to buy also beer…. That’s why these two items are found close to eachother in most stores
A.A. 04-05 Datawarehousing & Datamining 38
Association Rules
Applications:
• Cross/Up-selling (especially in e-comm., e.g., Amazon)� Cross-selling: push complemetary products� Up-selling: push similar products
• Catalog design• Store layout (e.g., diapers and beer close to each other)• Financial forecast• Medical diagnosis
A.A. 04-05 Datawarehousing & Datamining 39
Association Rules
Def.: Frequent itemset = itemset with support ≥ min supGeneral Strategy to discover all association rules:1. Find all frequent itemsets2. ∀ frequent itemset X, output all rules X-Y⇒⇒⇒⇒Y, with Y⊂X,
which satisfy the min confindence constraintObservation:Min sup and min confidence are objective measures ofinterestingness. Their proper setting, however, requiresuser’s domain knowledge. Low values may yield exponentially(in |I|) many rules, high values may cut off interesting rules
A.A. 04-05 Datawarehousing & Datamining 40
Association Rules
Example
For rule {A} ⇒ {C} :support = support({A,C}) = 50%confidence = support({A,C})/support({A}) = 66.6%
For rule {C} ⇒ {A} (same support as {A} ⇒ {C}):confidence = support({A,C})/support({C}) = 100%
Min. sup 50%Min. confidence 50%
B, E, F40A, D30A, C20
A, B, C10Items boughtTransaction-id
50%{A, C}50%{C}50%{B}75%{A}
SupportFrequent Itemsets
A.A. 04-05 Datawarehousing & Datamining 41
Association Rules
Dealing with Large OutputsObservation: depending on the values of min sup and onthe dataset, the number frequent itemsets can beexponentially large but may contain a lot of redundancy
Goal: determine a subset of frequent itemsets ofconsiderably smaller size, which provides the sameinformation content, i.e., from which the complete set offrequent itemsets can be derived without further information.
A.A. 04-05 Datawarehousing & Datamining 42
Association Rules
Notion of ClosednessI (items), D (transactions), min sup (support threshold)Closed Frequent Itemsets =
{X⊆I : supp(X)≥min sup & supp(Y)<supp(X) ∀ I⊇Y⊃X}• It’s a subset of all frequent itemsets• For every frequent itemset X there exists a closed
frequent itemset Y⊇X such that supp(Y)=supp(X), i.e., Yand X occur in exactly the same transcations
• All frequent itemsets and their frequencies can be derivedfrom the closed frequent itemsets without furtherinformation
A.A. 04-05 Datawarehousing & Datamining 43
Association Rules
Notion of MaximalityMaximal Frequent Itemsets =
{X⊆I : supp(X)≥min sup & supp(Y)<min sup ∀ I⊇Y⊃X}• It’s a subset of the closed frequent itemsets, hence of all
frequent itemsets• For every (closed) frequent itemset X there exists a
maximal frequent itemset Y⊇X• All frequent itemsets can be derived from the maximal
frequent itemsets without further information, howevertheir frequencies must be determined from D⇒⇒⇒⇒ information loss
A.A. 04-05 Datawarehousing & Datamining 44
Association Rules
B, L, G, H40B, A, D, F50
B, A, D, F, L, G60
B, A, D, F, L, H30B, L20
B, A, D, F, G, H10ItemsTid
YESYESYESYESNO
Maximal
33446
Support
10, 40, 60B, G10, 30, 40B, H
20, 30, 40, 60B, L10, 30, 50, 60B, A, D, F
allB
SupportingTransactions
Closed FrequentItemsets
Min sup = 3DATASET
Example of closed andmaximal frequent itemsets
A.A. 04-05 Datawarehousing & Datamining 45
Association Rules
(B,6) (A,4) (D,4) (F,4) (L,4) (G,3) (H,3)
(B,4) (B,4) (A,4) (B,4) (A,4) (D,4) (B,4) (B,3) (B,3)
(B,4) (B,4) (B,4) (A,4)
(B,4)
(•,6)
= closed & maximal
= closed but not maximal
(All frequent) VS (closed frequent) VS (maximal frequent)
A.A. 04-05 Datawarehousing & Datamining 46
Sequential Patterns
Data Mining Problems
Sequential Patterns: discovery of frequent subsequences ina collection of sequences (sequence database), eachrepresenting a set of events occurring at subsequent times.The ordering of the events in the subsequences is relevant.
A.A. 04-05 Datawarehousing & Datamining 47
Sequential Patterns
Sequential PatternsPROBLEM: Given a set of sequences, find the complete setof frequent subsequences
sequence database A sequence: <(ef)(ab)(df)(c)(b)>
An element may contain a set of items.Items within an element are unorderedand are listed alphabetically.
<(a)(bc)(d)(c)>is a subsequence of<(<(a)(abc)(ac)(d)(cf)>
For min sup = 2, <(ab)(c)> is a frequent subsequence
<(e)(g)(af)(c)(b)(c)>40<(ef)(ab)(df)(c)(b)>30<(ad)(c)(bc)(ae)>20
<(a)(abc)(ac)(d)(cf)>10sequenceSID
A.A. 04-05 Datawarehousing & Datamining 48
Sequential Patterns
Applications:
• Marketing• Natural disaster forecast• Analysis of web log data• DNA analysis
A.A. 04-05 Datawarehousing & Datamining 49
Classification
Data Mining Problems
Classification/Regression: discovery of a model or functionthat maps objects into predefined classes (classification) orinto suitable values (regression). The model/function iscomputed on a training set (supervised learning)
A.A. 04-05 Datawarehousing & Datamining 50
Classification
Statement of the ProblemTraining Set: T = {t1, …, tn} set of n examplesEach example ti
• characterized by m features (ti(A1), …, ti(Am))• belongs to one of k classes (Ci : 1 ≤ i ≤ k)
GOALFrom the training data find a model to describe the classesaccurately and synthetically using the data’s features. Themodel will then be used to assign class labels to unknown(previously unseen) records
A.A. 04-05 Datawarehousing & Datamining 51
Classification
Applications:
• Classification of (potential) customers for: Credit approval,risk prediction, selective marketing
• Performance prediction based on selected indicators• Medical diagnosis based on symptoms or reactions to
therapy
A.A. 04-05 Datawarehousing & Datamining 52
Classification
Classification process
Training Set
Test Data
Unseen Data
VALIDATE
BUILD
PREDICTclass labels
model
A.A. 04-05 Datawarehousing & Datamining 53
Classification
Observations:• Features can be either categorical if belonging to
unordered domains (e.g., Car Type), or continuous, ifbelonging to ordered domains (e.g., Age)
• The class could be regarded as an additional attribute ofthe examples, which we want to predict
• Classification vs Regression:Classification: builds models for categorical classesRegression: builds models for continuous classes
• Several types of models exist: decision trees, neuralnetworks, bayesan (statistical) classifiers.
A.A. 04-05 Datawarehousing & Datamining 54
Classification
Classification using decision treesDefinition: Decision Tree for a training set T• Labeled tree• Each internal node v represents a test on a feature. The
edges from v to its children are labeled with mutuallyexclusive results of the test
• Each leaf w represents the subset of examples of Twhose features values are consistent with the testresults found along the path from the root to w. Theleaf is labeled with the majority class of the examples itcontains
A.A. 04-05 Datawarehousing & Datamining 55
Classification
Example:examples = car insurance applicants, class = insurance risk
highfamily20lowtruck32lowfamily68highsports43highsports18highfamily23RiskCar TypeAge
features classAge < 25
Car type ∈{sports}
High
High Low
Model(decision tree)
A.A. 04-05 Datawarehousing & Datamining 56
Clustering
Data Mining Problems
Clustering: grouping objects into classes with theobjective of maximizing intra-class similarity andminimizing inter-class similarity (unsupervised learning)
A.A. 04-05 Datawarehousing & Datamining 57
Clustering
Statement of the ProblemGIVEN: N objects, each characterized by p attributes(a.k.a. variables)GROUP: the objects into K clusters featuring• High intra-cluster similarity• Low inter-cluster similarityRemark: Clustering is an instance of unsupervised learning orlearning by observations, as opposed to supervised learningor learning by examples (classification)
A.A. 04-05 Datawarehousing & Datamining 58
Clustering
Exampleoutlier
attribute X
attribute Y
attribute X
attribute Y
object
A.A. 04-05 Datawarehousing & Datamining 59
Clustering
Several types of clustering problems exist dependingon the specific input/output requirements, and on thenotion of similarity:• The number of clusters K may be provided in input or not• As output, for each cluster one may want a representative
object, or a set of aggregate measurements, or thecomplete set of objects belonging to the cluster
• Distance-based clustering: similarity of objects is related tosome kind of geometric distance
• Conceptual clustering: a group of objects forms a cluster ifthey define a certain concept
A.A. 04-05 Datawarehousing & Datamining 60
Clustering
Applications:
• Marketing: identify groups of customers based on theirpurchasing patterns
• Biology: categorize genes with similar functionalities• Image processing: clustering of pixels• Web: clustering of documents in meta-search engines• GIS: identification of areas of similar land use
A.A. 04-05 Datawarehousing & Datamining 61
Clustering
Challenges:
• Scalability: strategies to deal with very large datasets• Variety of attribute types: defining a good notion of
similarity is hard in the presence of different types ofattributes and/or different scales of values
• Variety of cluster shapes: common distance measuresprovide only spherical clusters
• Noisy data: outliers may affect the quality of the clustering• Sensitivity to input ordering: the ordering of the input data
should not affect the output (or its quality)
A.A. 04-05 Datawarehousing & Datamining 62
Clustering
Main Distance-Based Clustering MethodsPartitioning Methods: create an initial partition of the objectsinto K clusters, and refine the clustering using iterativerelocation techniques. A cluster is represented either by themean value of its component objects or by a centrally locatedcomponent object (medoid)Hierarchical Methods: start with all objects belonging todistinct clusters and then successively merge the pair ofclosest clusters/objects into one single cluster (agglomerativeapproach); or start with all objects belonging to one clusterand successively split up a cluster into smaller ones (divisiveapproach)