very good minng
TRANSCRIPT
-
7/30/2019 Very Good Minng
1/301
Data Mining Tools
Overview & Tutorial
Ahmed Sameh
Prince Sultan University
Department of Computer Science &Info Sys
May 2010(Some slides belong to IBM)
1
-
7/30/2019 Very Good Minng
2/301
2
Introduction Outline
Define data mining
Data mining vs. databases
Basic data mining tasks
Data mining development
Data mining issues
Goal: Provide an overview of data mining.
-
7/30/2019 Very Good Minng
3/301
3
Introduction
Data is growing at a phenomenalrate
Users expect more sophisticatedinformation
How?
UNCOVER HIDDEN INFORMATION
DATA MINING
-
7/30/2019 Very Good Minng
4/301
4
Data Mining Definition
Finding hidden information in adatabase
Fit data to a model
Similar terms
Exploratory data analysis
Data driven discovery
Deductive learning
-
7/30/2019 Very Good Minng
5/301
5
Data Mining Algorithm
Objective: Fit Data to a Model
Descriptive
PredictivePreference Technique to choose
the best model
Search Technique to search thedata
Query
-
7/30/2019 Very Good Minng
6/301
6
Database Processing vs. DataMining Processing
QueryWell defined
SQL
Query
Poorly defined
No precise querylanguage
DataOperational data
OutputPrecise
Subset of database
DataNot operational data
OutputFuzzy
Not a subset of database
-
7/30/2019 Very Good Minng
7/301
7
Query Examples
Database
Data MiningFind all customers who have purchased milk
Find all items which are frequently purchased withmilk. (association rules)
Find all credit applicants with last name of Smith.Identify customers who have purchased more than
$10,000 in the last month.
Find all credit applicants who are poor creditrisks. (classification)
Identify customers with similar buying habits.(Clustering)
-
7/30/2019 Very Good Minng
8/301
8
Related Fields
Statistics
MachineLearning
Databases
Visualization
Data Mining andKnowledge Discovery
-
7/30/2019 Very Good Minng
9/301
9
Statistics, Machine Learningand Data Mining Statistics:
more theory-based more focused on testing hypotheses
Machine learning more heuristic
focused on improving performance of a learning agent also looks at real-time learning and robotics areas not part
of data mining
Data Mining and Knowledge Discovery integrates theory and heuristics focus on the entire process of knowledge discovery,
including data cleaning, learning, and integration andvisualization of results
Distinctions are fuzzy
-
7/30/2019 Very Good Minng
10/301
Definition
A class of database application that analyze
data in a database using tools which look
for trends or anomalies.
Data mining was invented by IBM.
-
7/30/2019 Very Good Minng
11/301
Purpose
To look for hidden patterns or previously
unknown relationships among the data in a
group of data that can be used to predict future
behavior.
Ex: Data mining software can help retail
companies find customers with common
interests.
-
7/30/2019 Very Good Minng
12/301
Background Information
Many of the techniques used by today's data
mining tools have been around for many years,
having originated in the artificial intelligence
research of the 1980s and early 1990s.
Data Mining tools are only now being applied
to large-scale database systems.
-
7/30/2019 Very Good Minng
13/301
The Need for Data Mining
The amount of raw data stored in corporate
data warehouses is growing rapidly.
There is too much data and complexity thatmight be relevant to a specific problem.
Data mining promises to bridge the analytical
gap by giving knowledgeworkers the tools to
navigate this complex analytical space.
-
7/30/2019 Very Good Minng
14/301
The Need for Data Mining, cont
The need for information has resulted in the
proliferation of data warehouses that integrate
information multiple sources to support
decision making.
Often include data from external sources, such
as customer demographics and household
information.
-
7/30/2019 Very Good Minng
15/301
Definition (Cont.)
Data mining is the exploration and analysis of large quantitiesof data in order to discover valid, novel, potentially useful,and ultimately understandable patterns in data.
Valid: The patterns hold in general.
Novel: We did not know the patternbeforehand.
Useful: We can devise actions from thepatterns.
Understandable: We can interpret andcomprehend the patterns.
-
7/30/2019 Very Good Minng
16/301
Of laws, Monsters, and GiantsMoores law: processing capacity doubles
every 18 months : CPU, cache, memoryIts more aggressive cousin:Disk storage capacity doubles every 9
months
1E+3
1E+4
1E+5
1E+6
1E+7
1988 1991 1994 1997 2000
disk TB
growth:
112%/y
Moore's Law:
58.7%/y
ExaByte
Disk TB Shipped per Year1998 Disk Trend (Jim Port er)
ht t :/ /www.d iskt rend .com/ d f/ o rt r k . d f.
What do the twolaws combined
produce?
A rapidly growing
gap between our
ability to generate
data, and our ability
-
7/30/2019 Very Good Minng
17/301
What is Data Mining?
Finding interesting structure indata
Structure: refers to statistical patterns,predictive models, hidden relationships
Examples of tasks addressed by Data Mining
Predictive Modeling (classification,regression)
Segmentation (Data Clustering )
Summarization
-
7/30/2019 Very Good Minng
18/301
-
7/30/2019 Very Good Minng
19/301
19
Major Application Areas forData Mining Solutions
Advertising Bioinformatics Customer Relationship Management (CRM)Database Marketing Fraud Detection
eCommerce Health Care Investment/SecuritiesManufacturing, Process Control Sports and Entertainment
TelecommunicationsWeb
-
7/30/2019 Very Good Minng
20/301
20
Data Mining
The non-trivial extraction of novel, implicit, andactionable knowledge from large datasets.
Extremely large datasets
Discovery of the non-obvious
Useful knowledge that can improve processesCan not be done manually
Technology to enable data exploration, data analysis,and data visualization of very large databases at a highlevel of abstraction, without a specific hypothesis in
mind. Sophisticated data search capability that uses statisticalalgorithms to discover patterns and correlations in data.
-
7/30/2019 Very Good Minng
21/301
21
Data Mining (cont.)
-
7/30/2019 Very Good Minng
22/301
22
Data Mining (cont.)
Data Mining is a step of Knowledge Discoveryin Databases (KDD) Process
Data Warehousing
Data SelectionData Preprocessing
Data Transformation
Data Mining
Interpretation/EvaluationData Mining is sometimes referred to as KDD
and DM and KDD tend to be used assynonyms
-
7/30/2019 Very Good Minng
23/301
23
Data Mining Evaluation
-
7/30/2019 Very Good Minng
24/301
24
Data Mining is Not
Data warehousing
SQL / Ad Hoc Queries / Reporting
Software AgentsOnline Analytical Processing (OLAP)
Data Visualization
-
7/30/2019 Very Good Minng
25/301
25
Data Mining Motivation
Changes in the Business Environment
Customers becoming more demanding
Markets are saturated
Databases today are huge:More than 1,000,000 entities/records/rows
From 10 to 10,000 fields/attributes/variables
Gigabytes and terabytes
Databases a growing at an unprecedentedrate
Decisions must be made rapidly
Decisions must be made with maximumknowledge
-
7/30/2019 Very Good Minng
26/301
Why Use Data Mining Today?
Human analysis skills are inadequate:
Volume and dimensionality of the data
High data growth rate
Availability of:
Data
StorageComputational power
Off-the-shelf software
Expertise
-
7/30/2019 Very Good Minng
27/301
An Abundance of Data
Supermarket scanners, POS data
Preferred customer cards
Credit card transactions
Direct mail response
Call center records
ATM machines
Demographic data
Sensor networks Cameras
Web server logs
Customer web site trails
-
7/30/2019 Very Good Minng
28/301
Evolution of Database Technology
1960s: IMS, network model 1970s: The relational data model, first relational
DBMS implementations 1980s: Maturing RDBMS, application-specific
DBMS, (spatial data, scientific data, image data,etc.), OODBMS 1990s: Mature, high-performance RDBMS
technology, parallel DBMS, terabyte datawarehouses, object-relational DBMS, middlewareand web technology
2000s: High availability, zero-administration,seamless integration into business processes
2010: Sensor database systems, databases onembedded systems, P2P database systems,
large-scale pub/sub systems, ???
-
7/30/2019 Very Good Minng
29/301
Much Commercial Support
Many data mining tools
http://www.kdnuggets.com/software
Database systems with data miningsupport
Visualization tools
Data mining process supportConsultants
http://www.kdnuggets.com/softwarehttp://www.kdnuggets.com/software -
7/30/2019 Very Good Minng
30/301
Why Use Data Mining Today?
Competitive pressure!
The secret of success is to know something thatnobody else knows.
Aristotle Onassis
Competition on service, not only on price (Banks,phone companies, hotel chains, rental carcompanies)
Personalization, CRM The real-time enterprise
Systemic listening
Security, homeland defense
-
7/30/2019 Very Good Minng
31/301
The Knowledge Discovery Process
Steps:
1. Identify business problem
2. Data mining3. Action
4. Evaluation and measurement
5. Deployment and integration intobusinesses processes
-
7/30/2019 Very Good Minng
32/301
Data Mining Step in Detail
2.1 Data preprocessing Data selection: Identify target
datasets and relevant fields
Data cleaning Remove noise and outliers
Data transformation
Create common units
Generate new fields
2.2 Data mining model construction
2.3 Model evaluation
-
7/30/2019 Very Good Minng
33/301
Preprocessing and Mining
Original Data
TargetData
Preprocessed
Data
PatternsKnowledge
Data
Integration
and Selection
Preprocessing
Model
Construction
Interpretation
-
7/30/2019 Very Good Minng
34/301
34
Data Mining Techniques
Data Mining Techniques
Descriptive Predictive
Clustering
Association
Classification
Regression
Sequential Analysis
Decision Tree
Rule Induction
Neural Networks
Nearest Neighbor Classification
-
7/30/2019 Very Good Minng
35/301
35
Data Mining Models and Tasks
-
7/30/2019 Very Good Minng
36/301
36
Basic Data Mining TasksClassification maps data into
predefined groups or classesSupervised learning
Pattern recognition
Prediction
Regression is used to map a data itemto a real valued prediction variable.
Clustering groups similar data
together into clusters.Unsupervised learning
Segmentation
Partitioning
-
7/30/2019 Very Good Minng
37/301
37
Basic Data Mining Tasks (contd)
Summarization maps data into subsetswith associated simple descriptions.
Characterization
Generalization
Link Analysis uncovers relationshipsamong data.
Affinity Analysis
Association Rules
Sequential Analysis determines sequentialpatterns.
-
7/30/2019 Very Good Minng
38/301
38
Ex: Time Series Analysis
Example: Stock MarketPredict future values
Determine similar patterns over time
Classify behavior
-
7/30/2019 Very Good Minng
39/301
39
Data Mining vs. KDD
Knowledge Discovery inDatabases (KDD): process offinding useful information and
patterns in data.
Data Mining: Use of algorithms toextract the information and patterns
derived by the KDD process.
-
7/30/2019 Very Good Minng
40/301
40
Data Mining DevelopmentSimilarity Measures
Hierarchical Clustering
IR SystemsImprecise Queries
Textual Data
Web Search Engines
Bayes TheoremRegression Analysis
EM Algorithm
K-Means Clustering
Time Series Analysis
Neural Networks
Decision Tree Algorithms
Algorithm Design TechniquesAlgorithm AnalysisData Structures
Relational Data ModelSQL
Association Rule AlgorithmsData Warehousing
Scalability Techniques
-
7/30/2019 Very Good Minng
41/301
41
KDD Issues
Human InteractionOverfitting
Outliers
Interpretation
Visualization
Large Datasets
High Dimensionality
-
7/30/2019 Very Good Minng
42/301
42
KDD Issues (contd)
Multimedia Data
Missing Data
Irrelevant Data
Noisy Data
Changing Data
IntegrationApplication
-
7/30/2019 Very Good Minng
43/301
43
Visualization Techniques
Graphical
Geometric
Icon-basedPixel-based
Hierarchical
Hybrid
-
7/30/2019 Very Good Minng
44/301
44
Data Mining Applications
Data Mining Applications:
-
7/30/2019 Very Good Minng
45/301
45
Data Mining Applications:Retail
Performing basket analysisWhich items customers tend to purchase together. This
knowledge can improve stocking, store layoutstrategies, and promotions.
Sales forecastingExamining time-based patterns helps retailers make
stocking decisions. If a customer purchases an itemtoday, when are they likely to purchase acomplementary item?
Database marketingRetailers can develop profiles of customers with certain
behaviors, for example, those who purchase designerlabels clothing or those who attend sales. Thisinformation can be used to focus costeffectivepromotions.
Merchandise planning and allocationWhen retailers add new stores, they can improve
merchandise planning and allocation by examining
patterns in stores with similar demographic
Data Mining Applications:
-
7/30/2019 Very Good Minng
46/301
46
Data Mining Applications:Banking
Card marketingBy identifying customer segments, card issuers and
acquirers can improve profitability with more effectiveacquisition and retention programs, targeted productdevelopment, and customized pricing.
Cardholder pricing and profitabilityCard issuers can take advantage of data mining
technology to price their products so as to maximizeprofit and minimize loss of customers. Includes risk-based pricing.
Fraud detection
Fraud is enormously costly. By analyzing pasttransactions that were later determined to befraudulent, banks can identify patterns.
Predictive life-cycle managementDM helps banks predict each customers lifetime value
and to service each segment appropriately (for example,
offering special deals and discounts).
Data Mining Applications:
-
7/30/2019 Very Good Minng
47/301
47
Data Mining Applications:Telecommunication
Call detail record analysis
Telecommunication companies accumulate detailedcall records. By identifying customer segments withsimilar use patterns, the companies can develop
attractive pricing and feature promotions.Customer loyalty
Some customers repeatedly switch providers, orchurn, to take advantage of attractive incentives
by competing companies. The companies can useDM to identify the characteristics of customers whoare likely to remain loyal once they switch, thusenabling the companies to target their spending oncustomers who will produce the most profit.
Data Mining Applications:
-
7/30/2019 Very Good Minng
48/301
48
Data Mining Applications:Other Applications
Customer segmentationAll industries can take advantage of DM to discover
discrete segments in their customer bases byconsidering additional variables beyond traditionalanalysis.
ManufacturingThrough choice boards, manufacturers are beginning to
customize products for customers; therefore they mustbe able to predict which features should be bundled tomeet customer demand.
WarrantiesManufacturers need to predict the number of customers
who will submit warranty claims and the average cost ofthose claims.
Frequent flier incentives
Airlines can identify groups of customers that can begiven incentives to fly more.
-
7/30/2019 Very Good Minng
49/301
49
Which are ourlowest/highest margin
customers ?
Who are my customersand what products
are they buying?
Which customers
are most likely to goto the competition ?
What impact willnew products/services
have on revenue
and margins?
What product prom-
-otions have the biggestimpact on revenue?
What is the most
effective distributionchannel?
A producer wants to know.
Data Data everywhere
-
7/30/2019 Very Good Minng
50/301
50
Data, Data everywhereyet ...
I cant find the data I need
data is scattered over thenetwork
many versions, subtledifferences
I cant get the data I need
need an expert to get the data
I cant understand the data Ifound
available data poorly documented
I cant use the data I found
results are unexpected
data needs to be transformed
from one form to other
-
7/30/2019 Very Good Minng
51/301
51
What is a Data Warehouse?
A single, complete andconsistent store of dataobtained from a variety
of different sourcesmade available to endusers in a what theycan understand and use
in a business context.
[Barry Devlin]
-
7/30/2019 Very Good Minng
52/301
52
What are the users saying...
Data should be integratedacross the enterprise
Summary data has a real
value to the organizationHistorical data holds the
key to understanding dataover time
What-if capabilities arerequired
-
7/30/2019 Very Good Minng
53/301
53
What is Data Warehousing?
A process of
transforming data intoinformation and
making it available tousers in a timelyenough manner to
make a difference
[Forrester Research, April1996]Data
Information
-
7/30/2019 Very Good Minng
54/301
54
Very Large Data Bases
Terabytes -- 10^12 bytes:
Petabytes -- 10^15 bytes:
Exabytes -- 10^18 bytes:
Zettabytes -- 10^21bytes:
Zottabytes -- 10^24bytes:
Walmart -- 24 Terabytes
Geographic InformationSystems
National Medical Records
Weather images
Intelligence AgencyVideos
Data Warehousing
-
7/30/2019 Very Good Minng
55/301
55
Data Warehousing --It is a process
Technique for assembling andmanaging data from varioussources for the purpose of
answering businessquestions. Thus makingdecisions that were notprevious possible
A decision support databasemaintained separately fromthe organizations operationaldatabase
-
7/30/2019 Very Good Minng
56/301
56
Data Warehouse
A data warehouse is a
subject-oriented
integrated
time-varying
non-volatile
collection of data that is used primarily in
organizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996
-
7/30/2019 Very Good Minng
57/301
Data Warehousing Concepts
Decision support is key for companies wantingto turn their organizational data into aninformation asset
Traditional database is transaction-oriented
while data warehouse is data-retrievaloptimized for decision-support Data Warehouse
"A subject-oriented, integrated, time-variant,and non-volatile collection of data in support ofmanagement's decision-making process"
OLAP (on-line analytical processing), DecisionSupport Systems (DSS), Executive InformationSystems (EIS), and data mining applications
57
What does data warehouse do?
-
7/30/2019 Very Good Minng
58/301
What does data warehouse do?
integrate diverse information fromvarious systems which enable users toquickly produce powerful ad-hoc queriesand perform complex analysis
create an infrastructure for reusing thedata in numerous ways
create an open systems environment tomake useful information easily accessibleto authorized users
help managers make informed decisions
58
-
7/30/2019 Very Good Minng
59/301
Benefits of Data Warehousing
Potential high returns on investment
Competitive advantage
Increased productivity of corporatedecision-makers
59
Comparison of OLTP and Data Warehousing
-
7/30/2019 Very Good Minng
60/301
Comparison of OLTP and Data Warehousing
OLTP systems Data warehousingsystemsHolds current data Holds historic dataStores detailed data Stores detailed, lightly, and
summarized data
Data is dynamic Data is largely staticRepetitive processing Ad hoc, unstructured, andheuristic processingHigh level of transaction throughput Medium to low transactionthroughputPredictable pattern of usage Unpredictable pattern of usageTransaction driven Analysis driven
Application oriented Subject orientedSupports day-to-day decisions Supports strategic decisionsServes large number of Serves relatively lower numberclerical / operational users of managerial users
60
-
7/30/2019 Very Good Minng
61/301
Data Warehouse Architecture
Operational Data Load Manager Warehouse Manager
Query Manager Detailed Data Lightly and Highly Summarized Data Archive / Backup Data Meta-Data End-user Access Tools
61
-
7/30/2019 Very Good Minng
62/301
End-user Access Tools
Reporting and query tools
Application development tools
Executive Information System (EIS)tools
Online Analytical Processing (OLAP)
toolsData mining tools
62
Data Warehousing Tools and Technologies
-
7/30/2019 Very Good Minng
63/301
Data Warehousing Tools and Technologies
Extraction, Cleansing, and TransformationTools
Data Warehouse DBMS Load performance
Load processing Data quality management Query performance Terabyte scalability Networked data warehouse
Warehouse administration Integrated dimensional tools Advanced query functionality
63
-
7/30/2019 Very Good Minng
64/301
Data Marts
A subset of data warehouse thatsupports the requirements of aparticular department or business
function
64
-
7/30/2019 Very Good Minng
65/301
Online Analytical Processing (OLAP)
OLAP
The dynamic synthesis, analysis, andconsolidation of large volume of multi-
dimensional data
Multi-dimensional OLAP
Cubes of data
65
Time
City
Produ
ct
type
-
7/30/2019 Very Good Minng
66/301
Problems of Data Warehousing
Underestimation of resources fordata loading
Hidden problem with source systems
Required data not capturedIncreased end-user demandsData homogenizationHigh demand for resourcesData ownershipHigh maintenanceLong duration projects
Com lexit of inte ration 66
-
7/30/2019 Very Good Minng
67/301
Codd's Rules for OLAP
Multi-dimensional conceptual view Transparency Accessibility Consistent reporting performance
Client-server architecture Generic dimensionality Dynamic sparse matrix handling Multi-user support Unrestricted cross-dimensional operations Intuitive data manipulation Flexible reporting Unlimited dimensions and aggregation levels
67
-
7/30/2019 Very Good Minng
68/301
OLAP Tools
Multi-dimensional OLAP (MOLAP)
Multi-dimensional DBMS (MDDBMS)
Relational OLAP (ROLAP)
Creation of multiple multi-dimensionalviews of the two-dimensional relations
Managed Query Environment (MQE)
Deliver selected data directly from theDBMS to the desktop in the form of adata cube, where it is stored, analyzed,
and manipulated locally 68
-
7/30/2019 Very Good Minng
69/301
Data Mining
Definition The process of extracting valid, previously
unknown, comprehensible, and actionableinformation from large database and usingit to make crucial business decisions
Knowledge discovery Association rules Sequential patterns Classification trees
Goals
Prediction Identification Classification Optimization
69
-
7/30/2019 Very Good Minng
70/301
Data Mining Techniques
Predictive Modeling
Supervised training with two phases
Training phase : building a model using
large sample of historical data calledthe training set
Testing phase : trying the model on
new dataDatabase Segmentation
Link Analysis
Deviation Detection 70
-
7/30/2019 Very Good Minng
71/301
What are Data Mining Tasks?
Classification
Regression
Clustering
Summarization
Dependency modeling
Change and Deviation Detection
71
-
7/30/2019 Very Good Minng
72/301
What are Data Mining Discoveries?
New Purchase Trends
Plan Investment Strategies
Detect Unauthorized Expenditure
Fraudulent Activities
Crime Trends
Smugglers-border crossing
72
-
7/30/2019 Very Good Minng
73/301
73
Data Warehouse Architecture
Data Warehouse
Engine
Optimized Loader
Extraction
Cleansing
Analyze
Query
Metadata Repository
Relational
Databases
Legacy
Data
Purchased
Data
ERP
Systems
Data Warehouse for Decision
-
7/30/2019 Very Good Minng
74/301
74
Data Warehouse for DecisionSupport & OLAP
Putting Information technology to help the
knowledge worker make faster and better
decisions
Which of my customers are most likely to goto the competition?
What product promotions have the biggest
impact on revenue?
How did the share price of software
companies correlate with profits over last 10
years?
-
7/30/2019 Very Good Minng
75/301
75
Decision Support
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than updateUse of the system is loosely defined and
can be ad-hoc
Used by managers and end-users tounderstand the business and make
judgements
Data Mining works with Warehouse
-
7/30/2019 Very Good Minng
76/301
76
gData
Data Warehousingprovides the Enterprisewith a memory
Data Mining providesthe Enterprise withintelligence
-
7/30/2019 Very Good Minng
77/301
77
We want to know ... Given a database of 100,000 names, which persons are the
least likely to default on their credit cards? Which types of transactions are likely to be fraudulent
given the demographics and transactional history of aparticular customer?
If I raise the price of my product by Rs. 2, what is the
effect on my ROI? If I offer only 2,500 airline miles as an incentive to
purchase rather than 5,000, how many lost responses willresult?
If I emphasize ease-of-use of the product as opposed to its
technical capabilities, what will be the net effect on myrevenues?
Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
A li ti A
-
7/30/2019 Very Good Minng
78/301
78
Application Areas
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added dataUtilities Power usage analysis
-
7/30/2019 Very Good Minng
79/301
79
Data Mining in Use
The US Government uses Data Mining totrack fraud
A Supermarket becomes an information
brokerBasketball teams use it to track game
strategy
Cross Selling
Warranty Claims Routing
Holding on to Good Customers
Weeding out Bad Customers
-
7/30/2019 Very Good Minng
80/301
80
What makes data mining possible?
Advances in the following areas aremaking data mining deployable:
data warehousing
better and more data (i.e., operational,behavioral, and demographic)
the emergence of easily deployed data
mining tools andthe advent of new data mining
techniques. -- Gartner Group
-
7/30/2019 Very Good Minng
81/301
81
Why Separate Data Warehouse?
Performance
Op dbs designed & tuned for known txs & workloads.
Complex OLAP queries would degrade perf. for op txs.
Special data organization, access & implementation
methods needed for multidimensional views & queries.
Function
Missing data: Decision support requires historical data, whichop dbs do not typically maintain.
Data consolidation: Decision support requires consolidation(aggregation, summarization) of data from manyheterogeneous sources: op dbs, external sources.
Data quality: Different sources typically use inconsistent datarepresentations, codes, and formats which have to bereconciled.
-
7/30/2019 Very Good Minng
82/301
82
What are Operational Systems?
They are OLTP systems
Run mission criticalapplications
Need to work withstringent performancerequirements forroutine tasks
Used to run abusiness!
RDBMS used for OLTP
-
7/30/2019 Very Good Minng
83/301
83
RDBMS used for OLTP
Database Systems have been usedtraditionally for OLTP
clerical data processing tasks
detailed, up to date data
structured repetitive tasks
read/update a few records
isolation, recovery and integrity arecritical
-
7/30/2019 Very Good Minng
84/301
84
Operational Systems
Run the business in real time
Based on up-to-the-second data
Optimized to handle largenumbers of simple read/write
transactionsOptimized for fast response to
predefined transactions
Used by people who deal withcustomers, products -- clerks,salespeople etc.
They are increasingly used bycustomers
-
7/30/2019 Very Good Minng
85/301
85
Examples of Operational Data
Data Industry Usage Technology VolumesCustomerFile All TrackCustomer
DetailsLegacy application, flatfiles, main frames Small-medium
AccountBalance Finance
Controlaccountactivities
Legacy applications,hierarchical databases,mainframe
Large
Point-of-Sale data Retail Generatebills, manage
stockERP, Client/Server,relational databases Very Large
CallRecord Telecomm-unications Billing Legacy application,hierarchical database,
mainframeVery Large
ProductionRecord Manufact-uring ControlProduction ERP,relational databases,
AS/400Medium
Application-Orientation vs.
-
7/30/2019 Very Good Minng
86/301
86
ppSubject-Orientation
Application-Orientation
Operational
Database
LoansCreditCard
Trust
Savings
Subject-Orientation
Data
Warehouse
Customer
VendorProduct
Activity
OLTP vs Data Warehouse
-
7/30/2019 Very Good Minng
87/301
87
OLTP vs. Data Warehouse
OLTP systems are tuned for knowntransactions and workloads whileworkload is not known a priori in a data
warehouseSpecial data organization, access methods
and implementation methods are neededto support data warehouse queries
(typically multidimensional queries)e.g., average amount spent on phone calls
between 9AM-5PM in Pune during the monthof December
OLTP vs Data Warehouse
-
7/30/2019 Very Good Minng
88/301
88
OLTP vs Data Warehouse
OLTP
ApplicationOriented
Used to runbusiness
Detailed data
Current up to date
Isolated DataRepetitive access
Clerical User
Warehouse (DSS)
Subject Oriented
Used to analyze
businessSummarized and
refined
Snapshot data
Integrated DataAd-hoc access
Knowledge User(Manager)
OLTP vs Data Warehouse
-
7/30/2019 Very Good Minng
89/301
89
OLTP vs Data Warehouse
OLTP
Performance Sensitive
Few Records accessed ata time (tens)
Read/Update Access
No data redundancy
Database Size 100MB-100 GB
Data Warehouse
Performance relaxed
Large volumes accessedat a time(millions)
Mostly Read (BatchUpdate)
Redundancy present
Database Size
100 GB - few terabytes
OLTP vs Data Warehouse
-
7/30/2019 Very Good Minng
90/301
90
OLTP vs Data Warehouse
OLTP
Transactionthroughput is theperformance metric
Thousands of users
Managed inentirety
Data Warehouse
Query throughputis the performancemetric
Hundreds of users
Managed bysubsets
-
7/30/2019 Very Good Minng
91/301
91
To summarize ...
OLTP Systems areused to runabusiness
The DataWarehouse helpsto optimizethebusiness
-
7/30/2019 Very Good Minng
92/301
92
Why Now?
Data is being produced
ERP provides clean data
The computing power is available
The computing power is affordable
The competitive pressures are
strongCommercial products are available
Myths surrounding OLAP Serversd
-
7/30/2019 Very Good Minng
93/301
93
and Data Marts
Data marts and OLAP servers are departmental
solutions supporting a handful of users
Million dollar massively parallel hardware is
needed to deliver fast time for complex queries
OLAP servers require massive and unwieldy
indices
Complex OLAP queries clog the network with
dataData warehouses must be at least 100 GB to be
effective
Source -- Arbor Software Home Page
-
7/30/2019 Very Good Minng
94/301
II. On-Line Analytical Processing (OLAP)
Making Decision
Support Possible
T l OL P Q
-
7/30/2019 Very Good Minng
95/301
95
Typical OLAP Queries
Write a multi-table join to compare sales for each
product line YTD this year vs. last year.
Repeat the above process to find the top 5
product contributors to margin.
Repeat the above process to find the sales of a
product line to new vs. existing customers.
Repeat the above process to find the customers
that have had negative sales growth.
What Is OLAP?
-
7/30/2019 Very Good Minng
96/301
96
* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html
What Is OLAP?
Online Analytical Processing - coined byEF Codd in 1994 paper contracted byArbor Software*
Generally synonymous with earlier terms such asDecisions Support, Business Intelligence, ExecutiveInformation System
OLAP = Multidimensional Database
MOLAP: Multidimensional OLAP (Arbor Essbase,Oracle Express)
ROLAP: Relational OLAP (Informix MetaCube,Microstrategy DSS Agent)
Th OLAP M k
-
7/30/2019 Very Good Minng
97/301
97
The OLAP Market
Rapid growth in the enterprise market1995: $700 Million1997: $2.1 Billion
Significant consolidation activity among
major DBMS vendors10/94: Sybase acquires ExpressWay7/95: Oracle acquires Express11/95: Informix acquires Metacube1/97: Arbor partners up with IBM10/96: Microsoft acquires Panorama
Result: OLAP shifted from small verticalniche to mainstream DBMS category
St th f OLAP
-
7/30/2019 Very Good Minng
98/301
98
Strengths of OLAP
It is a powerful visualization paradigm
It provides fast, interactive response
timesIt is good for analyzing time series
It can be useful to find some clusters and
outliers
Many vendors offer OLAP tools
OLAP I FASMI
-
7/30/2019 Very Good Minng
99/301
99
Nigel Pendse, Richard Creath - The OLAP Report
OLAP Is FASMI
Fast
Analysis
Shared
Multidimensional
Information
-
7/30/2019 Very Good Minng
100/301
100
Month
1 2 3 4 765
P
roduct
Toothpaste
JuiceCola
Milk
Cream
Soap
WS
N
Dimensions: Product, Region, Time
Hierarchical summarization paths
Product Region Time
Industry Country Year
Category Region Quarter
Product City Month Week
Office Day
Multi-dimensional Data
HeyI sold $100M worth of goods
A Vi l O ti Pi t (R t t )
-
7/30/2019 Very Good Minng
101/301
101
A Visual Operation: Pivot (Rotate)
10
47
30
12
Juice
Cola
Milk
Cream
3/1 3/2 3/3 3/4
Date
Product
Sli i d Di i
-
7/30/2019 Very Good Minng
102/301
102
Slicing and Dicing
Product
Sales ChannelRetail Direct Special
Household
Telecomm
Video
Audio IndiaFar East
Europe
The Telecomm Slice
R ll d D ill D
-
7/30/2019 Very Good Minng
103/301
103
Roll-up and Drill Down
Sales Channel
Region
Country
State
Location Address
SalesRepresentative
Higher Level ofAggregation
Low-levelDetails
-
7/30/2019 Very Good Minng
104/301
Results of Data Mining Include:
Forecasting what may happen in thefuture
Classifying people or things intogroups by recognizing patterns
Clustering people or things intogroups based on their attributes
Associating what events are likely to
occur togetherSequencing what events are likely to
lead to later events
-
7/30/2019 Very Good Minng
105/301
Data mining is not
Brute-force crunching ofbulk dataBlind application ofalgorithmsGoing to find relationships
where none existPresenting data in differentwaysA database intensive taskA difficult to understandtechnology requiring anadvanced degree incomputer science
-
7/30/2019 Very Good Minng
106/301
Data Mining versus OLAP
OLAP - On-lineAnalyticalProcessingProvides you
with a verygood view ofwhat ishappening,but can notpredict whatwill happen inthe future orwhy it ishappening
Data Mining Versus StatisticalAnalysis
-
7/30/2019 Very Good Minng
107/301
AnalysisData Mining
Originally developed to actas expert systems to solveproblems
Less interested in themechanics of thetechnique
If it makes sense thenlets use it
Does not requireassumptions to be madeabout data
Can find patterns in verylarge amounts of data
Requires understandingof data and businessproblem
Data Analysis
Tests for statisticalcorrectness of models Are statistical
assumptions of modelscorrect? Eg Is the R-Square
good? Hypothesis testing
Is the relationshipsignificant? Use a t-test to validate
significance Tends to rely on sampling Techniques are not
optimised for largeamounts of data
Requires strong statisticalskills
Examples of What People are
-
7/30/2019 Very Good Minng
108/301
p pDoing with Data Mining:
Fraud/Non-ComplianceAnomaly detection
Isolate the factors that
lead to fraud, waste and
abuse
Target auditing and
investigative efforts
more effectively
Credit/Risk Scoring
Intrusion detectionParts failure prediction
Recruiting/Attractingcustomers
Maximizingprofitability (crossselling, identifying
profitable customers)Service Delivery andCustomer Retention
Build profiles ofcustomers likelyto use which
servicesWeb Mining
-
7/30/2019 Very Good Minng
109/301
What data mining has done for...
Scheduled its workforce
to provide faster, more accurateanswers to questions.
The US Internal Revenue Service
needed to improve customerservice and...
-
7/30/2019 Very Good Minng
110/301
What data mining has done for...
analyzed suspects cell phoneusage to focus investigations.
The US Drug Enforcement
Agency needed to be more
effective in their drug bustsand
-
7/30/2019 Very Good Minng
111/301
What data mining has done for...
Reduced direct mail costs by 30%
while garnering 95% of the
campaigns revenue.
HSBC need to cross-sell more
effectively by identifying profiles
that would be interested in higheryielding investments and...
Suggestion:Predicting Washington
-
7/30/2019 Very Good Minng
112/301
Suggestion:Predicting Washington
C-Span has lunched a digitalarchieve of 500,000 hours of audiodebates.
Text Mining or Audio Mining of thesetalks to reveal cwetrain questionssuch as.
Example Application: Sports
-
7/30/2019 Very Good Minng
113/301
Example Application: Sports
IBM Advanced Scout analyzesNBA game statistics
Shots blocked
Assists
Fouls
Google: IBM Advanced Scout
Advanced Scout
-
7/30/2019 Very Good Minng
114/301
Advanced Scout
Example pattern: An analysis of thedata from a game played betweenthe New York Knicks and the CharlotteHornets revealed that When Glenn Rice
played the shooting guard position, heshot 5/6 (83%) on jump shots."
Pattern is interesting:The average shooting percentage for theCharlotte Hornets during that game was54%.
Data Mining: Types of Data
-
7/30/2019 Very Good Minng
115/301
Data Mining: Types of Data
Relational data and transactional dataSpatial and temporal data, spatio-
temporal observations
Time-series data
Text
Images, video
Mixtures of data
Sequence data
Features from processing other datasources
Data Mining Techniques
-
7/30/2019 Very Good Minng
116/301
Data Mining Techniques
Supervised learning
Classification and regression
Unsupervised learning
Clustering
Dependency modeling
Associations, summarization, causality
Outlier and deviation detection
Trend analysis and change detection
Different Types of Classifiers
-
7/30/2019 Very Good Minng
117/301
Different Types of Classifiers
Linear discriminant analysis (LDA)Quadratic discriminant analysis
(QDA)
Density estimation methodsNearest neighbor methods
Logistic regression
Neural networksFuzzy set theory
Decision Trees
Test Sample Estimate
-
7/30/2019 Very Good Minng
118/301
Test Sample Estimate
Divide D into D1 and D2Use D1 to construct the classifier d
Then use resubstitution estimateR(d,D2) to calculate the estimatedmisclassification error of d
Unbiased and efficient, but removes
D2 from training dataset D
V-fold Cross Validation
-
7/30/2019 Very Good Minng
119/301
V-fold Cross Validation
Procedure:Construct classifier d from D
Partition D into V datasets D1, , DV
Construct classifier di using D \ DiCalculate the estimated misclassification
error R(di,Di) of di using test sample DiFinal misclassification estimate:
Weighted combination of individualmisclassification errors:R(d,D) = 1/V R(di,Di)
Cross-Validation: Example
-
7/30/2019 Very Good Minng
120/301
Cross-Validation: Example
d
d1
d2
d3
Cross-Validation
-
7/30/2019 Very Good Minng
121/301
Cross-Validation
Misclassification estimate obtainedthrough cross-validation is usuallynearly unbiased
Costly computation (we need tocompute d, and d1, , dV);computation of di is nearly asexpensive as computation of d
Preferred method to estimate qualityof learning algorithms in themachine learning literature
Decision Tree Construction
-
7/30/2019 Very Good Minng
122/301
Decision Tree Construction
Three algorithmic components:Split selection (CART, C4.5, QUEST,
CHAID, CRUISE, )
Pruning (direct stopping rule, testdataset pruning, cost-complexitypruning, statistical tests, bootstrapping)
Data access (CLOUDS, SLIQ, SPRINT,RainForest, BOAT, UnPivot operator)
Goodness of a Split
-
7/30/2019 Very Good Minng
123/301
Goodness of a Split
Consider node t with impurity phi(t)
The reduction in impuritythroughsplitting predicate s (t splits into
children nodes tL with impurityphi(tL) and tR with impurity phi(tR))is:
phi(s,t) = phi(t) pL phi(tL) pRphi(tR)
Pruning Methods
-
7/30/2019 Very Good Minng
124/301
Pruning Methods
Test dataset pruning
Direct stopping rule
Cost-complexity pruning
MDL pruning
Pruning by randomization testing
Stopping Policies
-
7/30/2019 Very Good Minng
125/301
Stopping Policies
A stopping policy indicates when furthergrowth of the tree at a node t iscounterproductive.
All records are of the same class
The attribute values of all records areidentical
All records have missing values
At most one class has a number ofrecords larger than a user-specifiednumber
All records go to the same child node if t
is split (only possible with some split
Test Dataset Pruning
-
7/30/2019 Very Good Minng
126/301
Test Dataset Pruning
Use an independent test sample Dto estimate the misclassification costusing the resubstitution estimate
R(T,D) at each nodeSelect the subtree T of T with the
smallest expected cost
Missing Values
-
7/30/2019 Very Good Minng
127/301
Missing Values
What is the problem?During computation of the splitting
predicate, we can selectively ignore
records with missing values (note thatthis has some problems)
But if a record r misses the value of thevariable in the splitting attribute, r can
not participate further in treeconstruction
Algorithms for missing values address
this roblem
Mean and Mode Imputation
-
7/30/2019 Very Good Minng
128/301
Mean and Mode Imputation
Assume record r has missing valuer.X, and splitting variable is X.
Simplest algorithm:
If X is numerical (categorical), imputethe overall mean (mode)
Improved algorithm:
If X is numerical (categorical), imputethe mean(X|t.C) (the mode(X|t.C))
Decision Trees: Summary
-
7/30/2019 Very Good Minng
129/301
Decision Trees: Summary
Many application of decision treesThere are many algorithms available for:Split selection
Pruning
Handling Missing Values
Data Access
Decision tree construction still activeresearch area (after 20+ years!)
Challenges: Performance, scalability,evolving datasets, new applications
Supervised vs Unsupervised Learning
-
7/30/2019 Very Good Minng
130/301
Supervised vs. Unsupervised Learning
Supervised y=F(x): true function
D: labeled training set
D: {xi,F(xi)}
Learn:G(x): model trained topredict labels D
Goal:E[(F(x)-G(x))2] 0
Well defined criteria:Accuracy, RMSE, ...
UnsupervisedGenerator: true model
D: unlabeled datasample
D: {xi}
Learn
??????????
Goal:
??????????
Well defined criteria:
??????????
Clustering: Unsupervised Learning
-
7/30/2019 Very Good Minng
131/301
Clustering Unsupervised Learning
Given:Data Set D (training set)
Similarity/distance metric/information
Find:Partitioning of data
Groups of similar/close items
Similarity?
-
7/30/2019 Very Good Minng
132/301
Similarity?
Groups of similar customersSimilar demographics
Similar buying behavior
Similar health
Similar products
Similar cost
Similar function
Similar store
Similarity usually is domain/problemspecific
Clustering: Informal ProblemDefinition
-
7/30/2019 Very Good Minng
133/301
Definition
Input:A data set ofNrecords each given as a d-
dimensional data feature vector.
Output:
Determine a natural, useful partitioningof the data set into a number of (k)clusters and noise such that we have:High similarity of records within each cluster
(intra-cluster similarity)
Low similarity of records between clusters(inter-cluster similarity)
Types of Clustering
-
7/30/2019 Very Good Minng
134/301
ypes of Cluster ng
Hard Clustering:Each object is in one and only one
cluster
Soft Clustering:Each object has a probability of being
in each cluster
Clustering Algorithms
-
7/30/2019 Very Good Minng
135/301
ust r ng gor thms
Partitioning-based clusteringK-means clustering
K-medoids clustering
EM (expectation maximization) clustering
Hierarchical clustering
Divisive clustering (top down)
Agglomerative clustering (bottom up)
Density-Based MethodsRegions of dense points separated by sparser
regions of relatively low density
K-Means Clustering Algorithm
-
7/30/2019 Very Good Minng
136/301
K g g m
Initialize k cluster centersDo
Assignment step: Assign each data point to its closestcluster center
Re-estimation step: Re-compute cluster centers
While (there are still changes in the cluster centers)
Visualization at:
http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html
Issues
http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.htmlhttp://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.htmlhttp://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.htmlhttp://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.htmlhttp://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html -
7/30/2019 Very Good Minng
137/301
Why is K-Means working: How does it find the cluster centers?
Does it find an optimal clustering
What are good starting points for the algorithm?
What is the right number of cluster centers?
How do we know it will terminate?
Agglomerative Clustering
-
7/30/2019 Very Good Minng
138/301
gg g
Algorithm: Put each item in its own cluster (all singletons)
Find all pairwise distances between clusters
Merge the two closestclusters
Repeat until everything is in one cluster
Observations:
Results in a hierarchical clustering
Yields a clustering for each possible number ofclusters
Greedy clustering: Result is not optimal for anycluster size
Density-Based Clustering
-
7/30/2019 Very Good Minng
139/301
y g
A cluster is defined as a connected densecomponent.
Density is defined in terms of number ofneighbors of a point.
We can find clusters of arbitrary shape
Market Basket Analysis
-
7/30/2019 Very Good Minng
140/301
y
Consider shopping cart filled withseveral items
Market basket analysis tries to
answer the following questions:Who makes purchases?
What do customers buy together?
In what order do customers purchaseitems?
Market Basket Analysis
-
7/30/2019 Very Good Minng
141/301
y
Given:A database of
customertransactions
Each transaction isa set of items
Example:Transaction withTID 111 containsitems {Pen, Ink,Milk, Juice}
TID CID Date Item Qty111 201 5/1/99 Pen 2
111 201 5/1/99 Ink 1
111 201 5/1/99 Milk 3
111 201 5/1/99 Juice 6
112 105 6/3/99 Pen 1
112 105 6/3/99 Ink 1
112 105 6/3/99 Milk 1
113 106 6/5/99 Pen 1
113 106 6/5/99 Milk 1
114 201 7/1/99 Pen 2
114 201 7/1/99 Ink 2114 201 7/1/99 Juice 4
Market Basket Analysis (Contd.)
-
7/30/2019 Very Good Minng
142/301
y ( )Coocurrences
80% of all customers purchase items X,Y and Z together.
Association rules
60% of all customers who purchase Xand Y also buy Z.
Sequential patterns
60% of customers who first buy X alsopurchase Y within three weeks.
Confidence and Support
-
7/30/2019 Very Good Minng
143/301
pp
We prune the set of all possibleassociation rules using twointerestingness measures:
Confidence of a rule:X Y has confidence c if P(Y|X) = c
Support of a rule:X Y has support s if P(XY) = s
We can also define
Support of an itemset (acoocurrence) XY:
Market Basket Analysis:Applications
-
7/30/2019 Very Good Minng
144/301
pp
Sample ApplicationsDirect marketing
Fraud detection for medical insurance
Floor/shelf planningWeb site layout
Cross-selling
Applications of Frequent Itemsets
-
7/30/2019 Very Good Minng
145/301
pp q
Market Basket Analysis
Association Rules
Classification (especially: text, rare
classes)
Seeds for construction of BayesianNetworks
Web log analysis
Collaborative filtering
Association Rule Algorithms
-
7/30/2019 Very Good Minng
146/301
g
More abstract problem redux
Breadth-first search
Depth-first search
Problem Redux
-
7/30/2019 Very Good Minng
147/301
Abstract: A set of items {1,2,,k}
A dabase of transactions(itemsets) D={T1, T2, ,Tn},Tj subset {1,2,,k}
GOAL:
Find all itemsets that appear inat least x transactions
(appear in == are subsetsof)
I subset T: T supports I
For an itemset I, the number oftransactions it appears in is
called the support of I.
Concrete: I = {milk, bread, cheese,
}
D = {{milk,bread,cheese},{bread,cheese,juice}, }
GOAL:
Find all itemsets that appear
in at least 1000transactions
{milk,bread,cheese}supports {milk,bread}
Problem Redux (Contd.)
-
7/30/2019 Very Good Minng
148/301
Definitions: An itemset is frequent if it
is a subset of at least xtransactions. (FI.)
An itemset is maximallyfrequent if it is frequentand it does not have afrequent superset. (MFI.)
GOAL: Given x, find allfrequent (maximally
frequent) itemsets (to bestored in the FI (MFI)).
Obvious relationship:MFI subset FI
Example:D={ {1,2,3}, {1,2,3},
{1,2,3}, {1,2,4} }
Minimum support x = 3
{1,2} is frequent{1,2,3} is maximal frequent
Support({1,2}) = 4
All maximal frequent
itemsets: {1,2,3}
Applications
-
7/30/2019 Very Good Minng
149/301
Spatial association rules
Web mining
Market basket analysis
User/customer profiling
ExtenSuggestionssions: SequentialPatterns
-
7/30/2019 Very Good Minng
150/301
In the Market Itemset Analysisreplace Milk, Pen, etc with names ofmedications and use the idea in
Hospital Data mining new proposalThe idea of swaem intelligence add
to it the extra analysis pf the
inducyion rules in this set of slides.
Kraft Foods: Direct Marketing
-
7/30/2019 Very Good Minng
151/301
Kraft Foods: Direct Marketing
Company maintains a large database of purchases by customers.
Data mining1. Analysts identified associations among groups of products
bought by particular segments of customers.
2. Sent out 3 sets of coupons to various households.
Better response rates: 50 % increase in sales for one itsproducts
Continue to use of this approach
Health Insurance Commission of Australia: Insurance Fraud
Commission maintains a database of insurance claims,includinglaboratory tests ordered during the diagnosis of patients.
Data mining
1. Identified the practice of "up coding" to reflect moreexpensive tests than are necessary.
2. Now monitors orders for lab tests.
Commission expects to save US$1,000,000 / year byeliminating the practice of "up coding.
HNC Software: Credit Card Fraud
-
7/30/2019 Very Good Minng
152/301
Payment Fraud
Large issuers of cards may lose
$10 million / year due to fraud
Difficult to identify the few transactions among thousands which
reflect potential fraud
Falcon software
Mines data through neural networks
Introduced in September 1992
Models each cardholder's requested transaction against the customer's
past spending history.
processes several hundred requests per second
compares current transaction with customer's history
identifies the transactions most likely to be frauds
enables bank to stop high-risk transactions before they are
authorized
Used by many retail banks: currently monitors
160 million card accounts for fraud
New Account Fraud
-
7/30/2019 Very Good Minng
153/301
New Account Fraud
Fraudulent applications for credit cards are growing at 50 %
per year
Falcon Sentry software
Mines data through neural networks and a rule baseIntroduced in September 1992
Checks information on applications against data from
credit bureaus
Allows card issuers to simultaneously:
increase the proportion of applications received
reduce the proportion of fraudulent applications
authorized
Quality Control
-
7/30/2019 Very Good Minng
154/301
y
IBM Microelectronics: Quality Control Analyzed manufacturing data on Dynamic Random Access Memory
(DRAM) chips.
Data mining
1. Built predictive models of
manufacturing yield (% non-defective)
effects of production parameters on chip performance.
2. Discovered critical factors behind
production yield &
product performance.3. Created a new design for the chip
increased yield saved millions of dollars in direct
manufacturing costs
enhanced product performance by substantially lowering the
memory cycle time
Retail Sales
-
7/30/2019 Very Good Minng
155/301
B & L Stores
Belk and Leggett Stores =
one of largest retail chains
280 stores in southeast U.S.
data warehouse contains 100s of gigabytes (billioncharacters) of data
data mining to:
increase sales
reduce costs
Selected DSS Agent from MicroStrategy, Inc.
analyize merchandizing (patterns of sales)
manage inventory
Market Basket Analysis
-
7/30/2019 Very Good Minng
156/301
DSS Agent
uses intelligent agents data mining
provides multiple functions
recognizes sales patterns among stores
discovers sales patterns by
time of day day of year
category of product
etc.
swiftly identifies trends & shifts in customer tastes
performs Market Basket Analysis (MBA)
analyzes Point-of-Sale or -Service (POS) data
identifies relationships among products and/or services purchased
E.g. A customer who buys Brand X slacks has a 35% chance of
buying Brand Y shirts.
Agent tool is also used by other Fortune 1000 firms
average ROI > 300 %
Case Based Reasoning
(CBR)
-
7/30/2019 Very Good Minng
157/301
(CBR)
case A targetcase B
General scheme for a case based reasoning (CBR) model. The target cas
matched against similar precedents in the historical database, such as cas
Case Based Reasoning (CBR)
-
7/30/2019 Very Good Minng
158/301
Learning through the accumulation of experience
Key issues
Indexing:storing cases for quick, effective access of precedents
Retrieval:accessing the appropriate precedent cases
Advantages
Explicit knowledge form recognizable to humans
No need to re-code knowledge for computer processing
Limitations
Retrieving precedents based on superficial featuresE.g. Matching Indonesia with U.S. because both have similar population size
Traditional approach ignores the issue of generalizing knowledge
Genetic Algorithm
-
7/30/2019 Very Good Minng
159/301
Generation of candidate solutions using the procedures of biologicalevolution.
Procedure
0. Initialize.Create a population of potential solutions ("organisms").
1. Evaluate.Determine the level of "fitness" for each solution.
2. Cull.Discard the poor solutions.
3. Breed.a. Select 2 "fit" solutions to serve as parents.b. From the 2 parents, generate offspring.
* Crossover:Cut the parents at random and switch the 2 halves.
* Mutation:
Randomly change the value in a parent solution.4. Repeat.
Go back to Step 1 above.
Genetic Algorithm (Cont.)
-
7/30/2019 Very Good Minng
160/301
Advantages Applicable to a wide range of problem domains.
Robustness:can obtain solutions even when the performance
function is highly irregular or input data are noisy.
Implicit parallelism:can search in many directions concurrently.
Limitations
Slow, like neural networks.But: computation can be distributed
over multiple processors
(unlike neural networks)
Source: www.pathology.washington.edu
Multistrategy Learning
-
7/30/2019 Very Good Minng
161/301
Every technique has advantages & limitations
Multistrategy approach
Take advantage of the strengths of diverse techniques
Circumvent the limitations of each methodology
Types of Models
-
7/30/2019 Very Good Minng
162/301
Prediction Models forPredicting and Classifying Regression algorithms
(predict numericoutcome): neural
networks, rule induction,CART (OLS regression,GLM)
Classification algorithmpredict symbolicoutcome): CHAID, C5.0
(discriminant analysis,logistic regression)
Descriptive Models forGrouping and FindingAssociations
Clustering/Grouping
algorithms: K-means,Kohonen
Association algorithms:
apriori, GRI
-
7/30/2019 Very Good Minng
163/301
Neural NetworksDescription
Difficult interpretation
Tends to overfit the data
Extensive amount of training time
A lot of data preparation
Works with all data types
R l I d ti
-
7/30/2019 Very Good Minng
164/301
Rule Induction
Description
Intuitive output
Handles all forms of numeric data,as well as non-numeric (symbolic)data
C5 Algorithm a special case of ruleinduction
Apriori
-
7/30/2019 Very Good Minng
165/301
p
Description Seeks association rules
in datasetMarket basket analysis
Sequence discovery
Data Mining Is
-
7/30/2019 Very Good Minng
166/301
The automated process of findingrelationships and patterns in storeddata
It is different from the use of SQLqueries and other businessintelligence tools
Data Mining Is
-
7/30/2019 Very Good Minng
167/301
Motivated by business need, largeamounts of available data, andhumans limited cognitive processing
abilitiesEnabled by data warehousing,
parallel processing, and data mining
algorithms
Common Types of Informationfrom Data Mining
-
7/30/2019 Very Good Minng
168/301
Associations -- identifies occurrencesthat are linked to a single event
Sequences -- identifies events that
are linked over timeClassification -- recognizes patterns
that describe the group to which an
item belongs
Common Types of Informationfrom Data Mining
-
7/30/2019 Very Good Minng
169/301
Clustering -- discovers differentgroupings within the data
Forecasting -- estimates future
values
Commonly Used Data MiningTechniques
-
7/30/2019 Very Good Minng
170/301
Artificial neural networksDecision trees
Genetic algorithms
Nearest neighbor method
Rule induction
The Current State of Data MiningTools
-
7/30/2019 Very Good Minng
171/301
Many of the vendors are small companiesIBM and SAS have been in the market for
some time, and more biggies aremoving into this market
BI tools and RDMS products areincreasingly including basic data miningcapabilities
Packaged data mining applications arebecoming common
The Data Mining Process
-
7/30/2019 Very Good Minng
172/301
Requires personnel with domain,data warehousing, and data miningexpertise
Requires data selection, dataextraction, data cleansing, and datatransformation
Most data mining tools work withhighly granular flat files
Is an iterative and interactive
rocess
Why Data Mining
-
7/30/2019 Very Good Minng
173/301
Credit ratings/targeted marketing:Given a database of 100,000 names, which persons are
the least likely to default on their credit cards?
Identify likely responders to sales promotions
Fraud detection
Which types of transactions are likely to be fraudulent,given the demographics and transactional history of aparticular customer?
Customer relationship management:
Which of my customers are likely to be the most loyal,and which are most likely to leave for a competitor? :
Data Mining helps extract suchinformation
Applications
-
7/30/2019 Very Good Minng
174/301
Banking: loan/credit card approvalpredict good customers based on old customers
Customer relationship management:identify those who are likely to leave for a competitor.
Targeted marketing:identify likely responders to promotions
Fraud detection: telecommunications,financial transactionsfrom an online stream of event identify fraudulent
events
Manufacturing and production:automatically adjust knobs when process parameter
changes
Applications (continued)
-
7/30/2019 Very Good Minng
175/301
Medicine: disease outcome, effectivenessof treatments
analyze patient disease history: findrelationship between diseases
Molecular/Pharmaceutical: identify newdrugs
Scientific data analysis:
identify new galaxies by searching for subclusters
Web site/store design and promotion:
find affinity of visitor to pages and modify
The KDD process
-
7/30/2019 Very Good Minng
176/301
Problem fomulation
Data collectionsubset data: sampling might hurt if highly skewed data
feature selection: principal component analysis,heuristic search
Pre-processing: cleaningname/address cleaning, different meanings (annual,
yearly), duplicate removal, supplying missing values
Transformation:
map complex objects e.g. time series data to featurese.g. frequency
Choosing mining task and mining method:
Result evaluation and Visualization:
Knowledge discovery is an iterative process
Relationship with other fields
-
7/30/2019 Very Good Minng
177/301
Overlaps with machine learning, statistics,artificial intelligence, databases,visualization but more stress on
scalability of number of features and instancesstress on algorithms and architectures
whereas foundations of methods andformulations provided by statistics and
machine learning.automation for handling large, heterogeneous
data
Some basic operations
-
7/30/2019 Very Good Minng
178/301
Predictive:Regression
Classification
Collaborative Filtering
Descriptive:
Clustering / similarity matching
Association rules and variants
Deviation detection
Classification
-
7/30/2019 Very Good Minng
179/301
Given old data about customers andpayments, predict new applicantsloan eligibility.
AgeSalary
Profession
LocationCustomer type
Previous customers Classifier Decision rulesSalary > 5 L
Prof. = Exec
New applicants data
Good/bad
Classification methods
-
7/30/2019 Very Good Minng
180/301
Goal: Predict class Ci = f(x1, x2, ..Xn)
Regression: (linear or any other
polynomial)a*x1 + b*x2 + c = Ci.
Nearest neighour
Decision tree classifier: divide decisionspace into piecewise constant regions.
Probabilistic/generative models
Neural networks: partition by non-
Nearest neighbor
-
7/30/2019 Very Good Minng
181/301
Define proximity between instances,find neighbors of new instance andassign majority class
Case based reasoning: whenattributes are more complicated thanreal-valued. Cons
Slow during application.
No feature selection.
Notion of proximity vague
Pros
+ Fast training
Clustering
-
7/30/2019 Very Good Minng
182/301
Unsupervised learning when old data withclass labels not available e.g. whenintroducing a new product.
Group/cluster existing customers based ontime series of payment history such thatsimilar customers in same cluster.
Key requirement: Need a good measure ofsimilarity between instances.
Identify micro-markets and develop
policies for each
Applications
-
7/30/2019 Very Good Minng
183/301
Customer segmentation e.g. for targetedmarketing
Group/cluster existing customers based ontime series of payment history such that
similar customers in same cluster.Identify micro-markets and develop policies
for each
Collaborative filtering:
group based on common items purchased
Text clustering
Compression
Distance functions
-
7/30/2019 Very Good Minng
184/301
Numeric data: euclidean, manhattandistances
Categorical data: 0/1 to indicatepresence/absence followed by
Hamming distance (# dissimilarity)
Jaccard coefficients: #similarity in 1s/(# of1s)
data dependent measures: similarity of A andB depends on co-occurance with C.
Combined numeric and categorical data:
weighted normalized distance:
Clustering methods
-
7/30/2019 Very Good Minng
185/301
Hierarchical clusteringagglomerative Vs divisive
single link Vs complete link
Partitional clusteringdistance-based: K-means
model-based: EM
density-based:
Partitional methods: K-means
-
7/30/2019 Very Good Minng
186/301
Criteria: minimize sum of square ofdistanceBetween each point and centroid of the
cluster.
Between each pair of points in thecluster
Algorithm:
Select initial partition with K clusters:random, first K, K separated points
Repeat until stabilization:
Assign each point to closest cluster
center
Collaborative Filtering
-
7/30/2019 Very Good Minng
187/301
Given database of user preferences,predict preference of new user
Example: predict what new movies you willlike based on
your past preferencesothers with similar past preferences
their preferences for the new movies
Example: predict what books/CDs a personmay want to buy(and suggest it, or give discounts to
tempt customer)
Association rules
T
-
7/30/2019 Very Good Minng
188/301
Given set T of groups of items
Example: set of item setspurchased
Goal: find all rules on itemsetsof the form a-->b such that
support of a and b > userthreshold s
conditional probability (confidence)of b given a > user threshold c
Example: Milk --> bread
P h f d t A >
Milk, cerealTea, milk
Tea, rice, bread
cereal
Prevalent Interesting
-
7/30/2019 Very Good Minng
189/301
Analysts alreadyknow aboutprevalent rules
Interesting rulesare those thatdeviate from priorexpectation
Minings payoff isin findingsurprisingphenomena
1995
1998
Milk andcereal sell
together!
Zzzz...Milk and
cereal sell
together!
Applications of fast itemsetcounting
-
7/30/2019 Very Good Minng
190/301
Find correlated events:Applications in medicine: find
redundant tests
Cross selling in retail, bankingImprove predictive capability of
classifiers that assume attribute
independence New similarity measures of
categorical attributes [Mannila et al,
Application Areas
-
7/30/2019 Very Good Minng
191/301
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysisTransport Logistics management
Consumer goods promotion analysis
Data Service providers Value added dataUtilities Power usage analysis
Usage scenarios
-
7/30/2019 Very Good Minng
192/301
Data warehouse mining:assimilate data from operational sources
mine static data
Mining log data
Continuous mining: example in processcontrol
Stages in mining:
data selection pre-processing:cleaning transformation mining result evaluation visualization
Mining market
-
7/30/2019 Very Good Minng
193/301
Around 20 to 30 mining tool vendorsMajor tool players:Clementine,
IBMs Intelligent Miner,
SGIs MineSet,SASs Enterprise Miner.
All pretty much the same set of tools
Many embedded products:fraud detection:
electronic commerce applications,
health care,
customer relationship management: Epiphany
Vertical integration:Mining on the web
-
7/30/2019 Very Good Minng
194/301
Web log analysis for site design:what are popular pages,
what links are hard to find.
Electronic stores sales enhancements:recommendations, advertisement:
Collaborative filtering: Net perception,Wisewire
Inventory control: what was a shopperlooking for and could not find..
State of art in mining OLAPintegration
-
7/30/2019 Very Good Minng
195/301
Decision trees [Information discovery,Cognos]
find factors influencing high profits
Clustering [Pilot software]segment customers to define hierarchy on that
dimension
Time series analysis: [Seagates Holos]
Query for various shapes along time: eg. spikes,outliers
Multi-level Associations [Han et al.]
fi d i ti b t b f di i
Data Mining in Use
-
7/30/2019 Very Good Minng
196/301
The US Government uses Data Mining totrack fraud
A Supermarket becomes an information
brokerBasketball teams use it to track game
strategy
Cross Selling
Target Marketing
Holding on to Good Customers
Weeding out Bad Customers
Some success stories
-
7/30/2019 Very Good Minng
197/301
Network intrusion detection using a combinationof sequential rule discovery and classificationtree on 4 GB DARPA dataWon over (manual) knowledge engineering approach
http://www.cs.columbia.edu/~sal/JAM/PROJECT/
provides good detailed description of the entire processMajor US bank: customer attrition prediction
First segment customers based on financial behavior:found 3 segments
Build attrition models for each of the 3 segments
40-50% of attritions were predicted == factor of 18increase
Targeted credit marketing: major US banksfind customer segments based on 13 months credit
balances
What is KnowledgeSeeker?
-
7/30/2019 Very Good Minng
198/301
Data Mining 199
Produced by ANGOSS Software Corporation,who focus solely on data mining software.
Offer training and consulting services
Produce data mining add-ins which acceptsdata from all major databases
Works with popular query and reporting,
spreadsheet, statistical and OLAP & ROLAPtools.
Major Competitors
-
7/30/2019 Very Good Minng
199/301
Data Mining 200
Company Software
Clementine 6.0
Enterprise Miner 3.0
Intelligent Miner
Major Competitors
http://www.ibm.com/http://localhost/var/www/apps/conversion/tmp/scratch_1/ -
7/30/2019 Very Good Minng
200/301
Data Mining 201
Company Software
Mineset 3.1
Darwin
Scenario
Current Applications
http://www.cognos.com/http://www.oracle.com/http://localhost/var/www/apps/conversion/tmp/scratch_1/ -
7/30/2019 Very Good Minng
201/301
Data Mining 202
ManufacturingUsed by the R.R. Donnelly & Sons commercial
printing company to improve process control, cutcosts and increase productivity.
Used extensively by Hewlett Packard in theirUnited States manufacturing plants as a processcontrol tool both to analyze factors impactingproduct quality as well as to generate rules for
production control systems.
Current Applications
http://www.hp.com/Redirect/gw/useng_companyinfo/logo/=http://welcome.hp.com/country/us/eng/welcome.htm -
7/30/2019 Very Good Minng
202/301
Data Mining 203
AuditingUsed by the IRS to combat fraud,
reduce risk, and increase collectionrates.
Finance
Used by the Canadian Imperial Bankof Commerce (CIBC) to createmodels for fraud detection and risk
management.
Current Applications
CRM
-
7/30/2019 Very Good Minng
203/301
Data Mining 204
CRM
Telephony
Used by US West to reduce churning andincrease customer loyalty for a new voice
messaging technology.
Current Applications
Marketing
-
7/30/2019 Very Good Minng
204/301
Data Mining 205
Marketing
Used by the Washington Post toimprove their direct mail targetingand to conduct survey analysis.
Health Care
Used by the Oxford TransplantCenter to discover factors affectingtransplant survival rates.
Used by the University of Rochester
Cancer Center to study the effect ofanxiety on chemotherapy-relatednausea.
More Customers
http://washpost.com/http://www.aig.com/http://www.ameritrade.com/http://www.chase.com/ -
7/30/2019 Very Good Minng
205/301
Data Mining 206
Questions
1. What percentage of people in the test group have high blood pressure
http://www.glaxowellcome.com/http://www.aig.com/http://www.sbc.com/http://www.microsoft.com/http://www.ameritrade.com/http://www.chase.com/http://www.pacbell.com/http://www.generalelectric.com/http://www.texaco.com/http://www.pfizer.com/http://www.bankofamerica.com/http://www.allstate.com/ -
7/30/2019 Very Good Minng
206/301
Data Mining 207
p g p p g p g p
with these characteristics: 66-year-old male regular smoker that haslow to moderate salt consumption?
2. Do the risk levels change for a male with the same characteristics whoquit smoking? What are the percentages?
3. If you are a 2% milk drinker, how many factors are still interesting?
4. Knowing that salt consumption and smoking habits are interestingfactors, which one has a stronger correlation to blood pressure levels?
5. Grow an automatic tree. Look to see if gender is an interesting factorfor 55-year-old regular smoker who does not each cheese?
Association
-
7/30/2019 Very Good Minng
207/301
Classic market-basket analysis, which treats thepurchase of a number of items (for example, the
contents of a shopping basket) as a single transaction.
This information can be used to adjust inventories,
modify floor or shelf layouts, or introduce targetedpromotional activities to increase overall sales or
move specific products.
Example : 80 percent of all transactions in whichbeer was purchased also included potato chips.
Sequence-based analysis
-
7/30/2019 Very Good Minng
208/301
Traditional market-basket analysis deals witha collection of items as part of a point-in-time
transaction.
to identify a typical set of purchases that mightpredict the subsequent purchase of a specific
item.
Clustering
-
7/30/2019 Very Good Minng
209/301
Clustering approach address segmentationproblems.
These approaches assign records with a largenumber of attributes into a relatively small set of
groups or "segments."Example : Buying habits of multiple population
segments might be compared to determine whichsegments to target for a new sales campaign.
Classification
-
7/30/2019 Very Good Minng
210/301
Most commonly applied data miningtechnique
Algorithm uses preclassified examples todetermine the set of parameters required forproper discrimination.
Example : A classifier derived from theClassification approach is capable of
identifying risky loans, could be used to aid inthe decision of whether to grant a loan to anindividual.
Issues of Data Mining
-
7/30/2019 Very Good Minng
211/301
Present-day tools are strong but requiresignificant expertise to implement effectively.
Issues of Data Mining
Susceptibility to "dirty" or irrelevant data.Inability to "explain" results in human terms.
Issues
-
7/30/2019 Very Good Minng
212/301
susceptibility to "dirty" or irrelevant dataData mining tools of today simply take everything
they are given as factual and draw the resulting
conclusions.
Users must take the necessary precautions to
ensure that the data being analyzed is "clean."
Issues, cont
-
7/30/2019 Very Good Minng
213/301
inability to "explain" results in human termsMany of the tools employed in data mining
analysis use complex mathematical algorithms that
are not easily mapped into human terms.
what good does the information do if you dont
understand it?
Comparison with reporting, BI andOLAP
-
7/30/2019 Very Good Minng
214/301
Reporting
Simplerelationships
Choose therelevant factors
Examine alldetails
(Also applies tovisualisation &simple statistics)
Data MiningComplex
relationships
Automatically find
the relevant factorsShow only relevant
details
Prediction
Comparison with Statistics
-
7/30/2019 Very Good Minng
215/301
Statistical analysisMainly about
hypothesis testing
Focussed on
precision
Data miningMainly about
hypothesisgeneration
Focussed ondeployment
Example: data mining and customerprocesses
-
7/30/2019 Very Good Minng
216/301
Insight: Who are my customers andwhy do they behave the way theydo?
Prediction: Who is a good prospect,for what product, who is at risk,what is the next thing to offer?
Uses: Targeted marketing, mail-shots, call-centres, adaptive web-sites
Example: data mining and frauddetection
-
7/30/2019 Very Good Minng
217/301
Insight: How can (specificmethod of) fraud berecognised? What constitute
normal, abnormal andsuspicious events?
Prediction: Recognisesimilarity to previous frauds
how similar?Spot abnormal events howsuspicious?
Example: data mining anddiagnosing cancer
-
7/30/2019 Very Good Minng
218/301
Complex data from geneticsChallenging data mining problem
Find patterns of gene activation
indicating different diseases / stagesChanged the way I think about
cancerOncologist from Chicago Childrens
Memorial Hospital
Example: data mining and policing
-
7/30/2019 Very Good Minng
219/301
Knowing the patterns helps planeffective crime prevention
Crime hot-spots understood better
Sift through mountains of crimereports
Identify crime series
Other people save money usingdata mining we save lives.Policeforce homicide specialist and data miner
Data mining tools:Clementine and its philosophy
-
7/30/2019 Very Good Minng
220/301
How to do data mining
-
7/30/2019 Very Good Minng
221/301
Lots of data mining operationsHow do you glue them together to
solve a problem?
How do we actually do data mining?Methodology
Not just the right way, but any way
Myths about Data Mining (1)Data, Process and Tech
-
7/30/2019 Very Good Minng
222/301
Data mining is all about
massive data
It can be, but some importantdatasets are very small, and
sampling is often appropriate
Data mining is atechnical process
Business analysts perform
data mining every dayIt is a business process
Data mining is all
about algorithms
Algorithms are a key toolBut data mining is done by
people, not by algorithms
Data mining is all
about predictive accuracy
It's about usefulnessAccuracy is only a small
component
Myths about Data Mining (2)Data Quality
-
7/30/2019 Very Good Minng
223/301
Data mining only works
with clean data
Cleaning the data is partof the data mining process
Need not be clean initially
Data mining only works
with complete data
Data mining works withwhatever data you have.Complete is good,
incomplete is also ok.
Data mining only workswith correct