very good minng

7/30/2019 Very Good Minng

1/301

Data Mining Tools

Overview & Tutorial

Ahmed Sameh

Prince Sultan University

Department of Computer Science &Info Sys

May 2010(Some slides belong to IBM)

1


2/301

2

Introduction Outline

Define data mining

Data mining vs. databases

Basic data mining tasks

Data mining development

Data mining issues

Goal: Provide an overview of data mining.


3/301

3

Introduction

Data is growing at a phenomenalrate

Users expect more sophisticatedinformation

How?

UNCOVER HIDDEN INFORMATION

DATA MINING


4/301

4

Data Mining Definition

Finding hidden information in adatabase

Fit data to a model

Similar terms

Exploratory data analysis

Data driven discovery

Deductive learning


5/301

5

Data Mining Algorithm

Objective: Fit Data to a Model

Descriptive

PredictivePreference Technique to choose

the best model

Search Technique to search thedata

Query


6/301

6

Database Processing vs. DataMining Processing

QueryWell defined

SQL

Query

Poorly defined

No precise querylanguage

DataOperational data

OutputPrecise

Subset of database

DataNot operational data

OutputFuzzy

Not a subset of database


7/301

7

Query Examples

Database

Data MiningFind all customers who have purchased milk

Find all items which are frequently purchased withmilk. (association rules)

Find all credit applicants with last name of Smith.Identify customers who have purchased more than

$10,000 in the last month.

Find all credit applicants who are poor creditrisks. (classification)

Identify customers with similar buying habits.(Clustering)


8/301

8

Related Fields

Statistics

MachineLearning

Databases

Visualization

Data Mining andKnowledge Discovery


9/301

9

Statistics, Machine Learningand Data Mining Statistics:

more theory-based more focused on testing hypotheses

Machine learning more heuristic

focused on improving performance of a learning agent also looks at real-time learning and robotics areas not part

of data mining

Data Mining and Knowledge Discovery integrates theory and heuristics focus on the entire process of knowledge discovery,

including data cleaning, learning, and integration andvisualization of results

Distinctions are fuzzy


10/301

Definition

A class of database application that analyze

data in a database using tools which look

for trends or anomalies.

Data mining was invented by IBM.


11/301

Purpose

To look for hidden patterns or previously

unknown relationships among the data in a

group of data that can be used to predict future

behavior.

Ex: Data mining software can help retail

companies find customers with common

interests.


12/301

Background Information

Many of the techniques used by today's data

mining tools have been around for many years,

having originated in the artificial intelligence

research of the 1980s and early 1990s.

Data Mining tools are only now being applied

to large-scale database systems.


13/301

The Need for Data Mining

The amount of raw data stored in corporate

data warehouses is growing rapidly.

There is too much data and complexity thatmight be relevant to a specific problem.

Data mining promises to bridge the analytical

gap by giving knowledgeworkers the tools to

navigate this complex analytical space.


14/301

The Need for Data Mining, cont

The need for information has resulted in the

proliferation of data warehouses that integrate

information multiple sources to support

decision making.

Often include data from external sources, such

as customer demographics and household

information.


15/301

Definition (Cont.)

Data mining is the exploration and analysis of large quantitiesof data in order to discover valid, novel, potentially useful,and ultimately understandable patterns in data.

Valid: The patterns hold in general.

Novel: We did not know the patternbeforehand.

Useful: We can devise actions from thepatterns.

Understandable: We can interpret andcomprehend the patterns.


16/301

Of laws, Monsters, and GiantsMoores law: processing capacity doubles

every 18 months : CPU, cache, memoryIts more aggressive cousin:Disk storage capacity doubles every 9

months

1E+3

1E+4

1E+5

1E+6

1E+7

1988 1991 1994 1997 2000

disk TB

growth:

112%/y

Moore's Law:

58.7%/y

ExaByte

Disk TB Shipped per Year1998 Disk Trend (Jim Port er)

ht t :/ /www.d iskt rend .com/ d f/ o rt r k . d f.

What do the twolaws combined

produce?

A rapidly growing

gap between our

ability to generate

data, and our ability


17/301

What is Data Mining?

Finding interesting structure indata

Structure: refers to statistical patterns,predictive models, hidden relationships

Examples of tasks addressed by Data Mining

Predictive Modeling (classification,regression)

Segmentation (Data Clustering )

Summarization


18/301


19/301

19

Major Application Areas forData Mining Solutions

Advertising Bioinformatics Customer Relationship Management (CRM)Database Marketing Fraud Detection

eCommerce Health Care Investment/SecuritiesManufacturing, Process Control Sports and Entertainment

TelecommunicationsWeb


20/301

20

Data Mining

The non-trivial extraction of novel, implicit, andactionable knowledge from large datasets.

Extremely large datasets

Discovery of the non-obvious

Useful knowledge that can improve processesCan not be done manually

Technology to enable data exploration, data analysis,and data visualization of very large databases at a highlevel of abstraction, without a specific hypothesis in

mind. Sophisticated data search capability that uses statisticalalgorithms to discover patterns and correlations in data.


21/301

21

Data Mining (cont.)


22/301

22

Data Mining (cont.)

Data Mining is a step of Knowledge Discoveryin Databases (KDD) Process

Data Warehousing

Data SelectionData Preprocessing

Data Transformation

Data Mining

Interpretation/EvaluationData Mining is sometimes referred to as KDD

and DM and KDD tend to be used assynonyms


23/301

23

Data Mining Evaluation


24/301

24

Data Mining is Not

Data warehousing

SQL / Ad Hoc Queries / Reporting

Software AgentsOnline Analytical Processing (OLAP)

Data Visualization


25/301

25

Data Mining Motivation

Changes in the Business Environment

Customers becoming more demanding

Markets are saturated

Databases today are huge:More than 1,000,000 entities/records/rows

From 10 to 10,000 fields/attributes/variables

Gigabytes and terabytes

Databases a growing at an unprecedentedrate

Decisions must be made rapidly

Decisions must be made with maximumknowledge


26/301

Why Use Data Mining Today?

Human analysis skills are inadequate:

Volume and dimensionality of the data

High data growth rate

Availability of:

Data

StorageComputational power

Off-the-shelf software

Expertise


27/301

An Abundance of Data

Supermarket scanners, POS data

Preferred customer cards

Credit card transactions

Direct mail response

Call center records

ATM machines

Demographic data

Sensor networks Cameras

Web server logs

Customer web site trails


28/301

Evolution of Database Technology

1960s: IMS, network model 1970s: The relational data model, first relational

DBMS implementations 1980s: Maturing RDBMS, application-specific

DBMS, (spatial data, scientific data, image data,etc.), OODBMS 1990s: Mature, high-performance RDBMS

technology, parallel DBMS, terabyte datawarehouses, object-relational DBMS, middlewareand web technology

2000s: High availability, zero-administration,seamless integration into business processes

2010: Sensor database systems, databases onembedded systems, P2P database systems,

large-scale pub/sub systems, ???


29/301

Much Commercial Support

Many data mining tools

http://www.kdnuggets.com/software

Database systems with data miningsupport

Visualization tools

Data mining process supportConsultants
http://www.kdnuggets.com/softwarehttp://www.kdnuggets.com/software


30/301

Why Use Data Mining Today?

Competitive pressure!

The secret of success is to know something thatnobody else knows.

Aristotle Onassis

Competition on service, not only on price (Banks,phone companies, hotel chains, rental carcompanies)

Personalization, CRM The real-time enterprise

Systemic listening

Security, homeland defense


31/301

The Knowledge Discovery Process

Steps:

1. Identify business problem

2. Data mining3. Action

4. Evaluation and measurement

5. Deployment and integration intobusinesses processes


32/301

Data Mining Step in Detail

2.1 Data preprocessing Data selection: Identify target

datasets and relevant fields

Data cleaning Remove noise and outliers

Data transformation

Create common units

Generate new fields

2.2 Data mining model construction

2.3 Model evaluation


33/301

Preprocessing and Mining

Original Data

TargetData

Preprocessed

Data

PatternsKnowledge

Data

Integration

and Selection

Preprocessing

Model

Construction

Interpretation


34/301

34

Data Mining Techniques


Descriptive Predictive

Clustering

Association

Classification

Regression

Sequential Analysis

Decision Tree

Rule Induction

Neural Networks

Nearest Neighbor Classification


35/301

35

Data Mining Models and Tasks


36/301

36

Basic Data Mining TasksClassification maps data into

predefined groups or classesSupervised learning

Pattern recognition

Prediction

Regression is used to map a data itemto a real valued prediction variable.

Clustering groups similar data

together into clusters.Unsupervised learning

Segmentation

Partitioning


37/301

37

Basic Data Mining Tasks (contd)

Summarization maps data into subsetswith associated simple descriptions.

Characterization

Generalization

Link Analysis uncovers relationshipsamong data.

Affinity Analysis

Association Rules

Sequential Analysis determines sequentialpatterns.


38/301

38

Ex: Time Series Analysis

Example: Stock MarketPredict future values

Determine similar patterns over time

Classify behavior


39/301

39

Data Mining vs. KDD

Knowledge Discovery inDatabases (KDD): process offinding useful information and

patterns in data.

Data Mining: Use of algorithms toextract the information and patterns

derived by the KDD process.


40/301

40

Data Mining DevelopmentSimilarity Measures

Hierarchical Clustering

IR SystemsImprecise Queries

Textual Data

Web Search Engines

Bayes TheoremRegression Analysis

EM Algorithm

K-Means Clustering

Time Series Analysis

Neural Networks

Decision Tree Algorithms

Algorithm Design TechniquesAlgorithm AnalysisData Structures

Relational Data ModelSQL

Association Rule AlgorithmsData Warehousing

Scalability Techniques


41/301

41

KDD Issues

Human InteractionOverfitting

Outliers

Interpretation

Visualization

Large Datasets

High Dimensionality


42/301

42

KDD Issues (contd)

Multimedia Data

Missing Data

Irrelevant Data

Noisy Data

Changing Data

IntegrationApplication


43/301

43

Visualization Techniques

Graphical

Geometric

Icon-basedPixel-based

Hierarchical

Hybrid


44/301

44

Data Mining Applications

Data Mining Applications:


45/301

45

Data Mining Applications:Retail

Performing basket analysisWhich items customers tend to purchase together. This

knowledge can improve stocking, store layoutstrategies, and promotions.

Sales forecastingExamining time-based patterns helps retailers make

stocking decisions. If a customer purchases an itemtoday, when are they likely to purchase acomplementary item?

Database marketingRetailers can develop profiles of customers with certain

behaviors, for example, those who purchase designerlabels clothing or those who attend sales. Thisinformation can be used to focus costeffectivepromotions.

Merchandise planning and allocationWhen retailers add new stores, they can improve

merchandise planning and allocation by examining

patterns in stores with similar demographic



46/301

46

Data Mining Applications:Banking

Card marketingBy identifying customer segments, card issuers and

acquirers can improve profitability with more effectiveacquisition and retention programs, targeted productdevelopment, and customized pricing.

Cardholder pricing and profitabilityCard issuers can take advantage of data mining

technology to price their products so as to maximizeprofit and minimize loss of customers. Includes risk-based pricing.

Fraud detection

Fraud is enormously costly. By analyzing pasttransactions that were later determined to befraudulent, banks can identify patterns.

Predictive life-cycle managementDM helps banks predict each customers lifetime value

and to service each segment appropriately (for example,

offering special deals and discounts).



47/301

47

Data Mining Applications:Telecommunication

Call detail record analysis

Telecommunication companies accumulate detailedcall records. By identifying customer segments withsimilar use patterns, the companies can develop

attractive pricing and feature promotions.Customer loyalty

Some customers repeatedly switch providers, orchurn, to take advantage of attractive incentives

by competing companies. The companies can useDM to identify the characteristics of customers whoare likely to remain loyal once they switch, thusenabling the companies to target their spending oncustomers who will produce the most profit.



48/301

48

Data Mining Applications:Other Applications

Customer segmentationAll industries can take advantage of DM to discover

discrete segments in their customer bases byconsidering additional variables beyond traditionalanalysis.

ManufacturingThrough choice boards, manufacturers are beginning to

customize products for customers; therefore they mustbe able to predict which features should be bundled tomeet customer demand.

WarrantiesManufacturers need to predict the number of customers

who will submit warranty claims and the average cost ofthose claims.

Frequent flier incentives

Airlines can identify groups of customers that can begiven incentives to fly more.


49/301

49

Which are ourlowest/highest margin

customers ?

Who are my customersand what products

are they buying?

Which customers

are most likely to goto the competition ?

What impact willnew products/services

have on revenue

and margins?

What product prom-

-otions have the biggestimpact on revenue?

What is the most

effective distributionchannel?

A producer wants to know.

Data Data everywhere


50/301

50

Data, Data everywhereyet ...

I cant find the data I need

data is scattered over thenetwork

many versions, subtledifferences

I cant get the data I need

need an expert to get the data

I cant understand the data Ifound

available data poorly documented

I cant use the data I found

results are unexpected

data needs to be transformed

from one form to other


51/301

51

What is a Data Warehouse?

A single, complete andconsistent store of dataobtained from a variety

of different sourcesmade available to endusers in a what theycan understand and use

in a business context.

[Barry Devlin]


52/301

52

What are the users saying...

Data should be integratedacross the enterprise

Summary data has a real

value to the organizationHistorical data holds the

key to understanding dataover time

What-if capabilities arerequired


53/301

53

What is Data Warehousing?

A process of

transforming data intoinformation and

making it available tousers in a timelyenough manner to

make a difference

[Forrester Research, April1996]Data

Information


54/301

54

Very Large Data Bases

Terabytes -- 10^12 bytes:

Petabytes -- 10^15 bytes:

Exabytes -- 10^18 bytes:

Zettabytes -- 10^21bytes:

Zottabytes -- 10^24bytes:

Walmart -- 24 Terabytes

Geographic InformationSystems

National Medical Records

Weather images

Intelligence AgencyVideos

Data Warehousing


55/301

55

Data Warehousing --It is a process

Technique for assembling andmanaging data from varioussources for the purpose of

answering businessquestions. Thus makingdecisions that were notprevious possible

A decision support databasemaintained separately fromthe organizations operationaldatabase


56/301

56

Data Warehouse

A data warehouse is a

subject-oriented

integrated

time-varying

non-volatile

collection of data that is used primarily in

organizational decision making.

-- Bill Inmon, Building the Data Warehouse 1996


57/301

Data Warehousing Concepts

Decision support is key for companies wantingto turn their organizational data into aninformation asset

Traditional database is transaction-oriented

while data warehouse is data-retrievaloptimized for decision-support Data Warehouse

"A subject-oriented, integrated, time-variant,and non-volatile collection of data in support ofmanagement's decision-making process"

OLAP (on-line analytical processing), DecisionSupport Systems (DSS), Executive InformationSystems (EIS), and data mining applications

57

What does data warehouse do?


58/301

What does data warehouse do?

integrate diverse information fromvarious systems which enable users toquickly produce powerful ad-hoc queriesand perform complex analysis

create an infrastructure for reusing thedata in numerous ways

create an open systems environment tomake useful information easily accessibleto authorized users

help managers make informed decisions

58


59/301

Benefits of Data Warehousing

Potential high returns on investment

Competitive advantage

Increased productivity of corporatedecision-makers

59

Comparison of OLTP and Data Warehousing


60/301

Comparison of OLTP and Data Warehousing

OLTP systems Data warehousingsystemsHolds current data Holds historic dataStores detailed data Stores detailed, lightly, and

summarized data

Data is dynamic Data is largely staticRepetitive processing Ad hoc, unstructured, andheuristic processingHigh level of transaction throughput Medium to low transactionthroughputPredictable pattern of usage Unpredictable pattern of usageTransaction driven Analysis driven

Application oriented Subject orientedSupports day-to-day decisions Supports strategic decisionsServes large number of Serves relatively lower numberclerical / operational users of managerial users

60


61/301

Data Warehouse Architecture

Operational Data Load Manager Warehouse Manager

Query Manager Detailed Data Lightly and Highly Summarized Data Archive / Backup Data Meta-Data End-user Access Tools

61


62/301

End-user Access Tools

Reporting and query tools

Application development tools

Executive Information System (EIS)tools

Online Analytical Processing (OLAP)

toolsData mining tools

62

Data Warehousing Tools and Technologies


63/301

Data Warehousing Tools and Technologies

Extraction, Cleansing, and TransformationTools

Data Warehouse DBMS Load performance

Load processing Data quality management Query performance Terabyte scalability Networked data warehouse

Warehouse administration Integrated dimensional tools Advanced query functionality

63


64/301

Data Marts

A subset of data warehouse thatsupports the requirements of aparticular department or business

function

64


65/301

Online Analytical Processing (OLAP)

OLAP

The dynamic synthesis, analysis, andconsolidation of large volume of multi-

dimensional data

Multi-dimensional OLAP

Cubes of data

65

Time

City

Produ

ct

type


66/301

Problems of Data Warehousing

Underestimation of resources fordata loading

Hidden problem with source systems

Required data not capturedIncreased end-user demandsData homogenizationHigh demand for resourcesData ownershipHigh maintenanceLong duration projects

Com lexit of inte ration 66


67/301

Codd's Rules for OLAP

Multi-dimensional conceptual view Transparency Accessibility Consistent reporting performance

Client-server architecture Generic dimensionality Dynamic sparse matrix handling Multi-user support Unrestricted cross-dimensional operations Intuitive data manipulation Flexible reporting Unlimited dimensions and aggregation levels

67


68/301

OLAP Tools

Multi-dimensional OLAP (MOLAP)

Multi-dimensional DBMS (MDDBMS)

Relational OLAP (ROLAP)

Creation of multiple multi-dimensionalviews of the two-dimensional relations

Managed Query Environment (MQE)

Deliver selected data directly from theDBMS to the desktop in the form of adata cube, where it is stored, analyzed,

and manipulated locally 68


69/301

Data Mining

Definition The process of extracting valid, previously

unknown, comprehensible, and actionableinformation from large database and usingit to make crucial business decisions

Knowledge discovery Association rules Sequential patterns Classification trees

Goals

Prediction Identification Classification Optimization

69


70/301


Predictive Modeling

Supervised training with two phases

Training phase : building a model using

large sample of historical data calledthe training set

Testing phase : trying the model on

new dataDatabase Segmentation

Link Analysis

Deviation Detection 70


71/301

What are Data Mining Tasks?

Classification

Regression

Clustering

Summarization

Dependency modeling

Change and Deviation Detection

71


72/301

What are Data Mining Discoveries?

New Purchase Trends

Plan Investment Strategies

Detect Unauthorized Expenditure

Fraudulent Activities

Crime Trends

Smugglers-border crossing

72


73/301

73

Data Warehouse Architecture

Data Warehouse

Engine

Optimized Loader

Extraction

Cleansing

Analyze

Query

Metadata Repository

Relational

Databases

Legacy

Data

Purchased

Data

ERP

Systems

Data Warehouse for Decision


74/301

74

Data Warehouse for DecisionSupport & OLAP

Putting Information technology to help the

knowledge worker make faster and better

decisions

Which of my customers are most likely to goto the competition?

What product promotions have the biggest

impact on revenue?

How did the share price of software

companies correlate with profits over last 10

years?


75/301

75

Decision Support

Used to manage and control business

Data is historical or point-in-time

Optimized for inquiry rather than updateUse of the system is loosely defined and

can be ad-hoc

Used by managers and end-users tounderstand the business and make

judgements

Data Mining works with Warehouse


76/301

76

gData

Data Warehousingprovides the Enterprisewith a memory

Data Mining providesthe Enterprise withintelligence


77/301

77

We want to know ... Given a database of 100,000 names, which persons are the

least likely to default on their credit cards? Which types of transactions are likely to be fraudulent

given the demographics and transactional history of aparticular customer?

If I raise the price of my product by Rs. 2, what is the

effect on my ROI? If I offer only 2,500 airline miles as an incentive to

purchase rather than 5,000, how many lost responses willresult?

If I emphasize ease-of-use of the product as opposed to its

technical capabilities, what will be the net effect on myrevenues?

Which of my customers are likely to be the most loyal?

Data Mining helps extract such information

A li ti A


78/301

78

Application Areas

Industry Application

Finance Credit Card Analysis

Insurance Claims, Fraud Analysis

Telecommunication Call record analysis

Transport Logistics management

Consumer goods promotion analysis

Data Service providers Value added dataUtilities Power usage analysis


79/301

79

Data Mining in Use

The US Government uses Data Mining totrack fraud

A Supermarket becomes an information

brokerBasketball teams use it to track game

strategy

Cross Selling

Warranty Claims Routing

Holding on to Good Customers

Weeding out Bad Customers


80/301

80

What makes data mining possible?

Advances in the following areas aremaking data mining deployable:

data warehousing

better and more data (i.e., operational,behavioral, and demographic)

the emergence of easily deployed data

mining tools andthe advent of new data mining

techniques. -- Gartner Group


81/301

81

Why Separate Data Warehouse?

Performance

Op dbs designed & tuned for known txs & workloads.

Complex OLAP queries would degrade perf. for op txs.

Special data organization, access & implementation

methods needed for multidimensional views & queries.

Function

Missing data: Decision support requires historical data, whichop dbs do not typically maintain.

Data consolidation: Decision support requires consolidation(aggregation, summarization) of data from manyheterogeneous sources: op dbs, external sources.

Data quality: Different sources typically use inconsistent datarepresentations, codes, and formats which have to bereconciled.


82/301

82

What are Operational Systems?

They are OLTP systems

Run mission criticalapplications

Need to work withstringent performancerequirements forroutine tasks

Used to run abusiness!

RDBMS used for OLTP


83/301

83

RDBMS used for OLTP

Database Systems have been usedtraditionally for OLTP

clerical data processing tasks

detailed, up to date data

structured repetitive tasks

read/update a few records

isolation, recovery and integrity arecritical


84/301

84

Operational Systems

Run the business in real time

Based on up-to-the-second data

Optimized to handle largenumbers of simple read/write

transactionsOptimized for fast response to

predefined transactions

Used by people who deal withcustomers, products -- clerks,salespeople etc.

They are increasingly used bycustomers


85/301

85

Examples of Operational Data

Data Industry Usage Technology VolumesCustomerFile All TrackCustomer

DetailsLegacy application, flatfiles, main frames Small-medium

AccountBalance Finance

Controlaccountactivities

Legacy applications,hierarchical databases,mainframe

Large

Point-of-Sale data Retail Generatebills, manage

stockERP, Client/Server,relational databases Very Large

CallRecord Telecomm-unications Billing Legacy application,hierarchical database,

mainframeVery Large

ProductionRecord Manufact-uring ControlProduction ERP,relational databases,

AS/400Medium

Application-Orientation vs.


86/301

86

ppSubject-Orientation

Application-Orientation

Operational

Database

LoansCreditCard

Trust

Savings

Subject-Orientation

Data

Warehouse

Customer

VendorProduct

Activity

OLTP vs Data Warehouse


87/301

87

OLTP vs. Data Warehouse

OLTP systems are tuned for knowntransactions and workloads whileworkload is not known a priori in a data

warehouseSpecial data organization, access methods

and implementation methods are neededto support data warehouse queries

(typically multidimensional queries)e.g., average amount spent on phone calls

between 9AM-5PM in Pune during the monthof December



88/301

88


OLTP

ApplicationOriented

Used to runbusiness

Detailed data

Current up to date

Isolated DataRepetitive access

Clerical User

Warehouse (DSS)

Subject Oriented

Used to analyze

businessSummarized and

refined

Snapshot data

Integrated DataAd-hoc access

Knowledge User(Manager)



89/301

89


OLTP

Performance Sensitive

Few Records accessed ata time (tens)

Read/Update Access

No data redundancy

Database Size 100MB-100 GB

Data Warehouse

Performance relaxed

Large volumes accessedat a time(millions)

Mostly Read (BatchUpdate)

Redundancy present

Database Size

100 GB - few terabytes



90/301

90


OLTP

Transactionthroughput is theperformance metric

Thousands of users

Managed inentirety

Data Warehouse

Query throughputis the performancemetric

Hundreds of users

Managed bysubsets


91/301

91

To summarize ...

OLTP Systems areused to runabusiness

The DataWarehouse helpsto optimizethebusiness


92/301

92

Why Now?

Data is being produced

ERP provides clean data

The computing power is available

The computing power is affordable

The competitive pressures are

strongCommercial products are available

Myths surrounding OLAP Serversd


93/301

93

and Data Marts

Data marts and OLAP servers are departmental

solutions supporting a handful of users

Million dollar massively parallel hardware is

needed to deliver fast time for complex queries

OLAP servers require massive and unwieldy

indices

Complex OLAP queries clog the network with

dataData warehouses must be at least 100 GB to be

effective

Source -- Arbor Software Home Page


94/301

II. On-Line Analytical Processing (OLAP)

Making Decision

Support Possible

T l OL P Q


95/301

95

Typical OLAP Queries

Write a multi-table join to compare sales for each

product line YTD this year vs. last year.

Repeat the above process to find the top 5

product contributors to margin.

Repeat the above process to find the sales of a

product line to new vs. existing customers.

Repeat the above process to find the customers

that have had negative sales growth.

What Is OLAP?


96/301

96

* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html

What Is OLAP?

Online Analytical Processing - coined byEF Codd in 1994 paper contracted byArbor Software*

Generally synonymous with earlier terms such asDecisions Support, Business Intelligence, ExecutiveInformation System

OLAP = Multidimensional Database

MOLAP: Multidimensional OLAP (Arbor Essbase,Oracle Express)

ROLAP: Relational OLAP (Informix MetaCube,Microstrategy DSS Agent)

Th OLAP M k


97/301

97

The OLAP Market

Rapid growth in the enterprise market1995: $700 Million1997: $2.1 Billion

Significant consolidation activity among

major DBMS vendors10/94: Sybase acquires ExpressWay7/95: Oracle acquires Express11/95: Informix acquires Metacube1/97: Arbor partners up with IBM10/96: Microsoft acquires Panorama

Result: OLAP shifted from small verticalniche to mainstream DBMS category

St th f OLAP


98/301

98

Strengths of OLAP

It is a powerful visualization paradigm

It provides fast, interactive response

timesIt is good for analyzing time series

It can be useful to find some clusters and

outliers

Many vendors offer OLAP tools

OLAP I FASMI


99/301

99

Nigel Pendse, Richard Creath - The OLAP Report

OLAP Is FASMI

Fast

Analysis

Shared

Multidimensional

Information


100/301

100

Month

1 2 3 4 765

P

roduct

Toothpaste

JuiceCola

Milk

Cream

Soap

WS

N

Dimensions: Product, Region, Time

Hierarchical summarization paths

Product Region Time

Industry Country Year

Category Region Quarter

Product City Month Week

Office Day

Multi-dimensional Data

HeyI sold $100M worth of goods

A Vi l O ti Pi t (R t t )


101/301

101

A Visual Operation: Pivot (Rotate)

10

47

30

12

Juice

Cola

Milk

Cream

3/1 3/2 3/3 3/4

Date

Product

Sli i d Di i


102/301

102

Slicing and Dicing

Product

Sales ChannelRetail Direct Special

Household

Telecomm

Video

Audio IndiaFar East

Europe

The Telecomm Slice

R ll d D ill D


103/301

103

Roll-up and Drill Down

Sales Channel

Region

Country

State

Location Address

SalesRepresentative

Higher Level ofAggregation

Low-levelDetails


104/301

Results of Data Mining Include:

Forecasting what may happen in thefuture

Classifying people or things intogroups by recognizing patterns

Clustering people or things intogroups based on their attributes

Associating what events are likely to

occur togetherSequencing what events are likely to

lead to later events


105/301

Data mining is not

Brute-force crunching ofbulk dataBlind application ofalgorithmsGoing to find relationships

where none existPresenting data in differentwaysA database intensive taskA difficult to understandtechnology requiring anadvanced degree incomputer science


106/301

Data Mining versus OLAP

OLAP - On-lineAnalyticalProcessingProvides you

with a verygood view ofwhat ishappening,but can notpredict whatwill happen inthe future orwhy it ishappening

Data Mining Versus StatisticalAnalysis


107/301

AnalysisData Mining

Originally developed to actas expert systems to solveproblems

Less interested in themechanics of thetechnique

If it makes sense thenlets use it

Does not requireassumptions to be madeabout data

Can find patterns in verylarge amounts of data

Requires understandingof data and businessproblem

Data Analysis

Tests for statisticalcorrectness of models Are statistical

assumptions of modelscorrect? Eg Is the R-Square

good? Hypothesis testing

Is the relationshipsignificant? Use a t-test to validate

significance Tends to rely on sampling Techniques are not

optimised for largeamounts of data

Requires strong statisticalskills

Examples of What People are


108/301

p pDoing with Data Mining:

Fraud/Non-ComplianceAnomaly detection

Isolate the factors that

lead to fraud, waste and

abuse

Target auditing and

investigative efforts

more effectively

Credit/Risk Scoring

Intrusion detectionParts failure prediction

Recruiting/Attractingcustomers

Maximizingprofitability (crossselling, identifying

profitable customers)Service Delivery andCustomer Retention

Build profiles ofcustomers likelyto use which

servicesWeb Mining


109/301

What data mining has done for...

Scheduled its workforce

to provide faster, more accurateanswers to questions.

The US Internal Revenue Service

needed to improve customerservice and...


110/301


analyzed suspects cell phoneusage to focus investigations.

The US Drug Enforcement

Agency needed to be more

effective in their drug bustsand


111/301


Reduced direct mail costs by 30%

while garnering 95% of the

campaigns revenue.

HSBC need to cross-sell more

effectively by identifying profiles

that would be interested in higheryielding investments and...

Suggestion:Predicting Washington


112/301

Suggestion:Predicting Washington

C-Span has lunched a digitalarchieve of 500,000 hours of audiodebates.

Text Mining or Audio Mining of thesetalks to reveal cwetrain questionssuch as.

Example Application: Sports


113/301

Example Application: Sports

IBM Advanced Scout analyzesNBA game statistics

Shots blocked

Assists

Fouls

Google: IBM Advanced Scout

Advanced Scout


114/301

Advanced Scout

Example pattern: An analysis of thedata from a game played betweenthe New York Knicks and the CharlotteHornets revealed that When Glenn Rice

played the shooting guard position, heshot 5/6 (83%) on jump shots."

Pattern is interesting:The average shooting percentage for theCharlotte Hornets during that game was54%.

Data Mining: Types of Data


115/301

Data Mining: Types of Data

Relational data and transactional dataSpatial and temporal data, spatio-

temporal observations

Time-series data

Text

Images, video

Mixtures of data

Sequence data

Features from processing other datasources



116/301


Supervised learning

Classification and regression

Unsupervised learning

Clustering

Dependency modeling

Associations, summarization, causality

Outlier and deviation detection

Trend analysis and change detection

Different Types of Classifiers


117/301

Different Types of Classifiers

Linear discriminant analysis (LDA)Quadratic discriminant analysis

(QDA)

Density estimation methodsNearest neighbor methods

Logistic regression

Neural networksFuzzy set theory

Decision Trees

Test Sample Estimate


118/301

Test Sample Estimate

Divide D into D1 and D2Use D1 to construct the classifier d

Then use resubstitution estimateR(d,D2) to calculate the estimatedmisclassification error of d

Unbiased and efficient, but removes

D2 from training dataset D

V-fold Cross Validation


119/301

V-fold Cross Validation

Procedure:Construct classifier d from D

Partition D into V datasets D1, , DV

Construct classifier di using D \ DiCalculate the estimated misclassification

error R(di,Di) of di using test sample DiFinal misclassification estimate:

Weighted combination of individualmisclassification errors:R(d,D) = 1/V R(di,Di)

Cross-Validation: Example


120/301

Cross-Validation: Example

d

d1

d2

d3

Cross-Validation


121/301

Cross-Validation

Misclassification estimate obtainedthrough cross-validation is usuallynearly unbiased

Costly computation (we need tocompute d, and d1, , dV);computation of di is nearly asexpensive as computation of d

Preferred method to estimate qualityof learning algorithms in themachine learning literature

Decision Tree Construction


122/301

Decision Tree Construction

Three algorithmic components:Split selection (CART, C4.5, QUEST,

CHAID, CRUISE, )

Pruning (direct stopping rule, testdataset pruning, cost-complexitypruning, statistical tests, bootstrapping)

Data access (CLOUDS, SLIQ, SPRINT,RainForest, BOAT, UnPivot operator)

Goodness of a Split


123/301

Goodness of a Split

Consider node t with impurity phi(t)

The reduction in impuritythroughsplitting predicate s (t splits into

children nodes tL with impurityphi(tL) and tR with impurity phi(tR))is:

phi(s,t) = phi(t) pL phi(tL) pRphi(tR)

Pruning Methods


124/301

Pruning Methods

Test dataset pruning

Direct stopping rule

Cost-complexity pruning

MDL pruning

Pruning by randomization testing

Stopping Policies


125/301

Stopping Policies

A stopping policy indicates when furthergrowth of the tree at a node t iscounterproductive.

All records are of the same class

The attribute values of all records areidentical

All records have missing values

At most one class has a number ofrecords larger than a user-specifiednumber

All records go to the same child node if t

is split (only possible with some split

Test Dataset Pruning


126/301

Test Dataset Pruning

Use an independent test sample Dto estimate the misclassification costusing the resubstitution estimate

R(T,D) at each nodeSelect the subtree T of T with the

smallest expected cost

Missing Values


127/301

Missing Values

What is the problem?During computation of the splitting

predicate, we can selectively ignore

records with missing values (note thatthis has some problems)

But if a record r misses the value of thevariable in the splitting attribute, r can

not participate further in treeconstruction

Algorithms for missing values address

this roblem

Mean and Mode Imputation


128/301

Mean and Mode Imputation

Assume record r has missing valuer.X, and splitting variable is X.

Simplest algorithm:

If X is numerical (categorical), imputethe overall mean (mode)

Improved algorithm:

If X is numerical (categorical), imputethe mean(X|t.C) (the mode(X|t.C))

Decision Trees: Summary


129/301

Decision Trees: Summary

Many application of decision treesThere are many algorithms available for:Split selection

Pruning

Handling Missing Values

Data Access

Decision tree construction still activeresearch area (after 20+ years!)

Challenges: Performance, scalability,evolving datasets, new applications

Supervised vs Unsupervised Learning


130/301

Supervised vs. Unsupervised Learning

Supervised y=F(x): true function

D: labeled training set

D: {xi,F(xi)}

Learn:G(x): model trained topredict labels D

Goal:E[(F(x)-G(x))2] 0

Well defined criteria:Accuracy, RMSE, ...

UnsupervisedGenerator: true model

D: unlabeled datasample

D: {xi}

Learn

??????????

Goal:

??????????

Well defined criteria:

??????????

Clustering: Unsupervised Learning


131/301

Clustering Unsupervised Learning

Given:Data Set D (training set)

Similarity/distance metric/information

Find:Partitioning of data

Groups of similar/close items

Similarity?


132/301

Similarity?

Groups of similar customersSimilar demographics

Similar buying behavior

Similar health

Similar products

Similar cost

Similar function

Similar store

Similarity usually is domain/problemspecific

Clustering: Informal ProblemDefinition


133/301

Definition

Input:A data set ofNrecords each given as a d-

dimensional data feature vector.

Output:

Determine a natural, useful partitioningof the data set into a number of (k)clusters and noise such that we have:High similarity of records within each cluster

(intra-cluster similarity)

Low similarity of records between clusters(inter-cluster similarity)

Types of Clustering


134/301

ypes of Cluster ng

Hard Clustering:Each object is in one and only one

cluster

Soft Clustering:Each object has a probability of being

in each cluster

Clustering Algorithms


135/301

ust r ng gor thms

Partitioning-based clusteringK-means clustering

K-medoids clustering

EM (expectation maximization) clustering

Hierarchical clustering

Divisive clustering (top down)

Agglomerative clustering (bottom up)

Density-Based MethodsRegions of dense points separated by sparser

regions of relatively low density

K-Means Clustering Algorithm


136/301

K g g m

Initialize k cluster centersDo

Assignment step: Assign each data point to its closestcluster center

Re-estimation step: Re-compute cluster centers

While (there are still changes in the cluster centers)

Visualization at:

http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html

Issues
http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.htmlhttp://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.htmlhttp://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.htmlhttp://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.htmlhttp://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html


137/301

Why is K-Means working: How does it find the cluster centers?

Does it find an optimal clustering

What are good starting points for the algorithm?

What is the right number of cluster centers?

How do we know it will terminate?

Agglomerative Clustering


138/301

gg g

Algorithm: Put each item in its own cluster (all singletons)

Find all pairwise distances between clusters

Merge the two closestclusters

Repeat until everything is in one cluster

Observations:

Results in a hierarchical clustering

Yields a clustering for each possible number ofclusters

Greedy clustering: Result is not optimal for anycluster size

Density-Based Clustering


139/301

y g

A cluster is defined as a connected densecomponent.

Density is defined in terms of number ofneighbors of a point.

We can find clusters of arbitrary shape

Market Basket Analysis


140/301

y

Consider shopping cart filled withseveral items

Market basket analysis tries to

answer the following questions:Who makes purchases?

What do customers buy together?

In what order do customers purchaseitems?



141/301

y

Given:A database of

customertransactions

Each transaction isa set of items

Example:Transaction withTID 111 containsitems {Pen, Ink,Milk, Juice}

TID CID Date Item Qty111 201 5/1/99 Pen 2

111 201 5/1/99 Ink 1

111 201 5/1/99 Milk 3

111 201 5/1/99 Juice 6

112 105 6/3/99 Pen 1

112 105 6/3/99 Ink 1

112 105 6/3/99 Milk 1

113 106 6/5/99 Pen 1

113 106 6/5/99 Milk 1

114 201 7/1/99 Pen 2

114 201 7/1/99 Ink 2114 201 7/1/99 Juice 4

Market Basket Analysis (Contd.)


142/301

y ( )Coocurrences

80% of all customers purchase items X,Y and Z together.

Association rules

60% of all customers who purchase Xand Y also buy Z.

Sequential patterns

60% of customers who first buy X alsopurchase Y within three weeks.

Confidence and Support


143/301

pp

We prune the set of all possibleassociation rules using twointerestingness measures:

Confidence of a rule:X Y has confidence c if P(Y|X) = c

Support of a rule:X Y has support s if P(XY) = s

We can also define

Support of an itemset (acoocurrence) XY:

Market Basket Analysis:Applications


144/301

pp

Sample ApplicationsDirect marketing

Fraud detection for medical insurance

Floor/shelf planningWeb site layout

Cross-selling

Applications of Frequent Itemsets


145/301

pp q


Association Rules

Classification (especially: text, rare

classes)

Seeds for construction of BayesianNetworks

Web log analysis

Collaborative filtering

Association Rule Algorithms


146/301

g

More abstract problem redux

Breadth-first search

Depth-first search

Problem Redux


147/301

Abstract: A set of items {1,2,,k}

A dabase of transactions(itemsets) D={T1, T2, ,Tn},Tj subset {1,2,,k}

GOAL:

Find all itemsets that appear inat least x transactions

(appear in == are subsetsof)

I subset T: T supports I

For an itemset I, the number oftransactions it appears in is

called the support of I.

Concrete: I = {milk, bread, cheese,

}

D = {{milk,bread,cheese},{bread,cheese,juice}, }

GOAL:

Find all itemsets that appear

in at least 1000transactions

{milk,bread,cheese}supports {milk,bread}

Problem Redux (Contd.)


148/301

Definitions: An itemset is frequent if it

is a subset of at least xtransactions. (FI.)

An itemset is maximallyfrequent if it is frequentand it does not have afrequent superset. (MFI.)

GOAL: Given x, find allfrequent (maximally

frequent) itemsets (to bestored in the FI (MFI)).

Obvious relationship:MFI subset FI

Example:D={ {1,2,3}, {1,2,3},

{1,2,3}, {1,2,4} }

Minimum support x = 3

{1,2} is frequent{1,2,3} is maximal frequent

Support({1,2}) = 4

All maximal frequent

itemsets: {1,2,3}

Applications


149/301

Spatial association rules

Web mining

Market basket analysis

User/customer profiling

ExtenSuggestionssions: SequentialPatterns


150/301

In the Market Itemset Analysisreplace Milk, Pen, etc with names ofmedications and use the idea in

Hospital Data mining new proposalThe idea of swaem intelligence add

to it the extra analysis pf the

inducyion rules in this set of slides.

Kraft Foods: Direct Marketing


151/301

Kraft Foods: Direct Marketing

Company maintains a large database of purchases by customers.

Data mining1. Analysts identified associations among groups of products

bought by particular segments of customers.

2. Sent out 3 sets of coupons to various households.

Better response rates: 50 % increase in sales for one itsproducts

Continue to use of this approach

Health Insurance Commission of Australia: Insurance Fraud

Commission maintains a database of insurance claims,includinglaboratory tests ordered during the diagnosis of patients.

Data mining

1. Identified the practice of "up coding" to reflect moreexpensive tests than are necessary.

2. Now monitors orders for lab tests.

Commission expects to save US$1,000,000 / year byeliminating the practice of "up coding.

HNC Software: Credit Card Fraud


152/301

Payment Fraud

Large issuers of cards may lose

$10 million / year due to fraud

Difficult to identify the few transactions among thousands which

reflect potential fraud

Falcon software

Mines data through neural networks

Introduced in September 1992

Models each cardholder's requested transaction against the customer's

past spending history.

processes several hundred requests per second

compares current transaction with customer's history

identifies the transactions most likely to be frauds

enables bank to stop high-risk transactions before they are

authorized

Used by many retail banks: currently monitors

160 million card accounts for fraud

New Account Fraud


153/301

New Account Fraud

Fraudulent applications for credit cards are growing at 50 %

per year

Falcon Sentry software

Mines data through neural networks and a rule baseIntroduced in September 1992

Checks information on applications against data from

credit bureaus

Allows card issuers to simultaneously:

increase the proportion of applications received

reduce the proportion of fraudulent applications

authorized

Quality Control


154/301

y

IBM Microelectronics: Quality Control Analyzed manufacturing data on Dynamic Random Access Memory

(DRAM) chips.

Data mining

1. Built predictive models of

manufacturing yield (% non-defective)

effects of production parameters on chip performance.

2. Discovered critical factors behind

production yield &

product performance.3. Created a new design for the chip

increased yield saved millions of dollars in direct

manufacturing costs

enhanced product performance by substantially lowering the

memory cycle time

Retail Sales


155/301

B & L Stores

Belk and Leggett Stores =

one of largest retail chains

280 stores in southeast U.S.

data warehouse contains 100s of gigabytes (billioncharacters) of data

data mining to:

increase sales

reduce costs

Selected DSS Agent from MicroStrategy, Inc.

analyize merchandizing (patterns of sales)

manage inventory



156/301

DSS Agent

uses intelligent agents data mining

provides multiple functions

recognizes sales patterns among stores

discovers sales patterns by

time of day day of year

category of product

etc.

swiftly identifies trends & shifts in customer tastes

performs Market Basket Analysis (MBA)

analyzes Point-of-Sale or -Service (POS) data

identifies relationships among products and/or services purchased

E.g. A customer who buys Brand X slacks has a 35% chance of

buying Brand Y shirts.

Agent tool is also used by other Fortune 1000 firms

average ROI > 300 %

Case Based Reasoning

(CBR)


157/301

(CBR)

case A targetcase B

General scheme for a case based reasoning (CBR) model. The target cas

matched against similar precedents in the historical database, such as cas

Case Based Reasoning (CBR)


158/301

Learning through the accumulation of experience

Key issues

Indexing:storing cases for quick, effective access of precedents

Retrieval:accessing the appropriate precedent cases

Advantages

Explicit knowledge form recognizable to humans

No need to re-code knowledge for computer processing

Limitations

Retrieving precedents based on superficial featuresE.g. Matching Indonesia with U.S. because both have similar population size

Traditional approach ignores the issue of generalizing knowledge

Genetic Algorithm


159/301

Generation of candidate solutions using the procedures of biologicalevolution.

Procedure

0. Initialize.Create a population of potential solutions ("organisms").

1. Evaluate.Determine the level of "fitness" for each solution.

2. Cull.Discard the poor solutions.

3. Breed.a. Select 2 "fit" solutions to serve as parents.b. From the 2 parents, generate offspring.

* Crossover:Cut the parents at random and switch the 2 halves.

* Mutation:

Randomly change the value in a parent solution.4. Repeat.

Go back to Step 1 above.

Genetic Algorithm (Cont.)


160/301

Advantages Applicable to a wide range of problem domains.

Robustness:can obtain solutions even when the performance

function is highly irregular or input data are noisy.

Implicit parallelism:can search in many directions concurrently.

Limitations

Slow, like neural networks.But: computation can be distributed

over multiple processors

(unlike neural networks)

Source: www.pathology.washington.edu

Multistrategy Learning


161/301

Every technique has advantages & limitations

Multistrategy approach

Take advantage of the strengths of diverse techniques

Circumvent the limitations of each methodology

Types of Models


162/301

Prediction Models forPredicting and Classifying Regression algorithms

(predict numericoutcome): neural

networks, rule induction,CART (OLS regression,GLM)

Classification algorithmpredict symbolicoutcome): CHAID, C5.0

(discriminant analysis,logistic regression)

Descriptive Models forGrouping and FindingAssociations

Clustering/Grouping

algorithms: K-means,Kohonen

Association algorithms:

apriori, GRI


163/301

Neural NetworksDescription

Difficult interpretation

Tends to overfit the data

Extensive amount of training time

A lot of data preparation

Works with all data types

R l I d ti


164/301

Rule Induction

Description

Intuitive output

Handles all forms of numeric data,as well as non-numeric (symbolic)data

C5 Algorithm a special case of ruleinduction

Apriori


165/301

p

Description Seeks association rules

in datasetMarket basket analysis

Sequence discovery

Data Mining Is


166/301

The automated process of findingrelationships and patterns in storeddata

It is different from the use of SQLqueries and other businessintelligence tools

Data Mining Is


167/301

Motivated by business need, largeamounts of available data, andhumans limited cognitive processing

abilitiesEnabled by data warehousing,

parallel processing, and data mining

algorithms

Common Types of Informationfrom Data Mining


168/301

Associations -- identifies occurrencesthat are linked to a single event

Sequences -- identifies events that

are linked over timeClassification -- recognizes patterns

that describe the group to which an

item belongs

Common Types of Informationfrom Data Mining


169/301

Clustering -- discovers differentgroupings within the data

Forecasting -- estimates future

values

Commonly Used Data MiningTechniques


170/301

Artificial neural networksDecision trees

Genetic algorithms

Nearest neighbor method

Rule induction

The Current State of Data MiningTools


171/301

Many of the vendors are small companiesIBM and SAS have been in the market for

some time, and more biggies aremoving into this market

BI tools and RDMS products areincreasingly including basic data miningcapabilities

Packaged data mining applications arebecoming common

The Data Mining Process


172/301

Requires personnel with domain,data warehousing, and data miningexpertise

Requires data selection, dataextraction, data cleansing, and datatransformation

Most data mining tools work withhighly granular flat files

Is an iterative and interactive

rocess

Why Data Mining


173/301

Credit ratings/targeted marketing:Given a database of 100,000 names, which persons are

the least likely to default on their credit cards?

Identify likely responders to sales promotions

Fraud detection

Which types of transactions are likely to be fraudulent,given the demographics and transactional history of aparticular customer?

Customer relationship management:

Which of my customers are likely to be the most loyal,and which are most likely to leave for a competitor? :

Data Mining helps extract suchinformation

Applications


174/301

Banking: loan/credit card approvalpredict good customers based on old customers

Customer relationship management:identify those who are likely to leave for a competitor.

Targeted marketing:identify likely responders to promotions

Fraud detection: telecommunications,financial transactionsfrom an online stream of event identify fraudulent

events

Manufacturing and production:automatically adjust knobs when process parameter

changes

Applications (continued)


175/301

Medicine: disease outcome, effectivenessof treatments

analyze patient disease history: findrelationship between diseases

Molecular/Pharmaceutical: identify newdrugs

Scientific data analysis:

identify new galaxies by searching for subclusters

Web site/store design and promotion:

find affinity of visitor to pages and modify

The KDD process


176/301

Problem fomulation

Data collectionsubset data: sampling might hurt if highly skewed data

feature selection: principal component analysis,heuristic search

Pre-processing: cleaningname/address cleaning, different meanings (annual,

yearly), duplicate removal, supplying missing values

Transformation:

map complex objects e.g. time series data to featurese.g. frequency

Choosing mining task and mining method:

Result evaluation and Visualization:

Knowledge discovery is an iterative process

Relationship with other fields


177/301

Overlaps with machine learning, statistics,artificial intelligence, databases,visualization but more stress on

scalability of number of features and instancesstress on algorithms and architectures

whereas foundations of methods andformulations provided by statistics and

machine learning.automation for handling large, heterogeneous

data

Some basic operations


178/301

Predictive:Regression

Classification

Collaborative Filtering

Descriptive:

Clustering / similarity matching

Association rules and variants

Deviation detection

Classification


179/301

Given old data about customers andpayments, predict new applicantsloan eligibility.

AgeSalary

Profession

LocationCustomer type

Previous customers Classifier Decision rulesSalary > 5 L

Prof. = Exec

New applicants data

Good/bad

Classification methods


180/301

Goal: Predict class Ci = f(x1, x2, ..Xn)

Regression: (linear or any other

polynomial)a*x1 + b*x2 + c = Ci.

Nearest neighour

Decision tree classifier: divide decisionspace into piecewise constant regions.

Probabilistic/generative models

Neural networks: partition by non-

Nearest neighbor


181/301

Define proximity between instances,find neighbors of new instance andassign majority class

Case based reasoning: whenattributes are more complicated thanreal-valued. Cons

Slow during application.

No feature selection.

Notion of proximity vague

Pros

+ Fast training

Clustering


182/301

Unsupervised learning when old data withclass labels not available e.g. whenintroducing a new product.

Group/cluster existing customers based ontime series of payment history such thatsimilar customers in same cluster.

Key requirement: Need a good measure ofsimilarity between instances.

Identify micro-markets and develop

policies for each

Applications


183/301

Customer segmentation e.g. for targetedmarketing

Group/cluster existing customers based ontime series of payment history such that

similar customers in same cluster.Identify micro-markets and develop policies

for each

Collaborative filtering:

group based on common items purchased

Text clustering

Compression

Distance functions


184/301

Numeric data: euclidean, manhattandistances

Categorical data: 0/1 to indicatepresence/absence followed by

Hamming distance (# dissimilarity)

Jaccard coefficients: #similarity in 1s/(# of1s)

data dependent measures: similarity of A andB depends on co-occurance with C.

Combined numeric and categorical data:

weighted normalized distance:

Clustering methods


185/301

Hierarchical clusteringagglomerative Vs divisive

single link Vs complete link

Partitional clusteringdistance-based: K-means

model-based: EM

density-based:

Partitional methods: K-means


186/301

Criteria: minimize sum of square ofdistanceBetween each point and centroid of the

cluster.

Between each pair of points in thecluster

Algorithm:

Select initial partition with K clusters:random, first K, K separated points

Repeat until stabilization:

Assign each point to closest cluster

center

Collaborative Filtering


187/301

Given database of user preferences,predict preference of new user

Example: predict what new movies you willlike based on

your past preferencesothers with similar past preferences

their preferences for the new movies

Example: predict what books/CDs a personmay want to buy(and suggest it, or give discounts to

tempt customer)

Association rules

T


188/301

Given set T of groups of items

Example: set of item setspurchased

Goal: find all rules on itemsetsof the form a-->b such that

support of a and b > userthreshold s

conditional probability (confidence)of b given a > user threshold c

Example: Milk --> bread

P h f d t A >

Milk, cerealTea, milk

Tea, rice, bread

cereal

Prevalent Interesting


189/301

Analysts alreadyknow aboutprevalent rules

Interesting rulesare those thatdeviate from priorexpectation

Minings payoff isin findingsurprisingphenomena

1995

1998

Milk andcereal sell

together!

Zzzz...Milk and

cereal sell

together!

Applications of fast itemsetcounting


190/301

Find correlated events:Applications in medicine: find

redundant tests

Cross selling in retail, bankingImprove predictive capability of

classifiers that assume attribute

independence New similarity measures of

categorical attributes [Mannila et al,

Application Areas


191/301

Industry Application

Finance Credit Card Analysis

Insurance Claims, Fraud Analysis

Telecommunication Call record analysisTransport Logistics management

Consumer goods promotion analysis

Data Service providers Value added dataUtilities Power usage analysis

Usage scenarios


192/301

Data warehouse mining:assimilate data from operational sources

mine static data

Mining log data

Continuous mining: example in processcontrol

Stages in mining:

data selection pre-processing:cleaning transformation mining result evaluation visualization

Mining market


193/301

Around 20 to 30 mining tool vendorsMajor tool players:Clementine,

IBMs Intelligent Miner,

SGIs MineSet,SASs Enterprise Miner.

All pretty much the same set of tools

Many embedded products:fraud detection:

electronic commerce applications,

health care,

customer relationship management: Epiphany

Vertical integration:Mining on the web


194/301

Web log analysis for site design:what are popular pages,

what links are hard to find.

Electronic stores sales enhancements:recommendations, advertisement:

Collaborative filtering: Net perception,Wisewire

Inventory control: what was a shopperlooking for and could not find..

State of art in mining OLAPintegration


195/301

Decision trees [Information discovery,Cognos]

find factors influencing high profits

Clustering [Pilot software]segment customers to define hierarchy on that

dimension

Time series analysis: [Seagates Holos]

Query for various shapes along time: eg. spikes,outliers

Multi-level Associations [Han et al.]

fi d i ti b t b f di i

Data Mining in Use


196/301

The US Government uses Data Mining totrack fraud

A Supermarket becomes an information

brokerBasketball teams use it to track game

strategy

Cross Selling

Target Marketing

Holding on to Good Customers

Weeding out Bad Customers

Some success stories


197/301

Network intrusion detection using a combinationof sequential rule discovery and classificationtree on 4 GB DARPA dataWon over (manual) knowledge engineering approach

http://www.cs.columbia.edu/~sal/JAM/PROJECT/

provides good detailed description of the entire processMajor US bank: customer attrition prediction

First segment customers based on financial behavior:found 3 segments

Build attrition models for each of the 3 segments

40-50% of attritions were predicted == factor of 18increase

Targeted credit marketing: major US banksfind customer segments based on 13 months credit

balances

What is KnowledgeSeeker?


198/301

Data Mining 199

Produced by ANGOSS Software Corporation,who focus solely on data mining software.

Offer training and consulting services

Produce data mining add-ins which acceptsdata from all major databases

Works with popular query and reporting,

spreadsheet, statistical and OLAP & ROLAPtools.

Major Competitors


199/301

Data Mining 200

Company Software

Clementine 6.0

Enterprise Miner 3.0

Intelligent Miner

Major Competitors
http://www.ibm.com/http://localhost/var/www/apps/conversion/tmp/scratch_1/


200/301

Data Mining 201

Company Software

Mineset 3.1

Darwin

Scenario

Current Applications
http://www.cognos.com/http://www.oracle.com/http://localhost/var/www/apps/conversion/tmp/scratch_1/


201/301

Data Mining 202

ManufacturingUsed by the R.R. Donnelly & Sons commercial

printing company to improve process control, cutcosts and increase productivity.

Used extensively by Hewlett Packard in theirUnited States manufacturing plants as a processcontrol tool both to analyze factors impactingproduct quality as well as to generate rules for

production control systems.

http://www.hp.com/Redirect/gw/useng_companyinfo/logo/=http://welcome.hp.com/country/us/eng/welcome.htm


202/301

Data Mining 203

AuditingUsed by the IRS to combat fraud,

reduce risk, and increase collectionrates.

Finance

Used by the Canadian Imperial Bankof Commerce (CIBC) to createmodels for fraud detection and risk

management.


CRM


203/301

Data Mining 204

CRM

Telephony

Used by US West to reduce churning andincrease customer loyalty for a new voice

messaging technology.


Marketing


204/301

Data Mining 205

Marketing

Used by the Washington Post toimprove their direct mail targetingand to conduct survey analysis.

Health Care

Used by the Oxford TransplantCenter to discover factors affectingtransplant survival rates.

Used by the University of Rochester

Cancer Center to study the effect ofanxiety on chemotherapy-relatednausea.

More Customers
http://washpost.com/http://www.aig.com/http://www.ameritrade.com/http://www.chase.com/


205/301

Data Mining 206

Questions

1. What percentage of people in the test group have high blood pressure
http://www.glaxowellcome.com/http://www.aig.com/http://www.sbc.com/http://www.microsoft.com/http://www.ameritrade.com/http://www.chase.com/http://www.pacbell.com/http://www.generalelectric.com/http://www.texaco.com/http://www.pfizer.com/http://www.bankofamerica.com/http://www.allstate.com/


206/301

Data Mining 207

p g p p g p g p

with these characteristics: 66-year-old male regular smoker that haslow to moderate salt consumption?

2. Do the risk levels change for a male with the same characteristics whoquit smoking? What are the percentages?

3. If you are a 2% milk drinker, how many factors are still interesting?

4. Knowing that salt consumption and smoking habits are interestingfactors, which one has a stronger correlation to blood pressure levels?

5. Grow an automatic tree. Look to see if gender is an interesting factorfor 55-year-old regular smoker who does not each cheese?

Association


207/301

Classic market-basket analysis, which treats thepurchase of a number of items (for example, the

contents of a shopping basket) as a single transaction.

This information can be used to adjust inventories,

modify floor or shelf layouts, or introduce targetedpromotional activities to increase overall sales or

move specific products.

Example : 80 percent of all transactions in whichbeer was purchased also included potato chips.

Sequence-based analysis


208/301

Traditional market-basket analysis deals witha collection of items as part of a point-in-time

transaction.

to identify a typical set of purchases that mightpredict the subsequent purchase of a specific

item.

Clustering


209/301

Clustering approach address segmentationproblems.

These approaches assign records with a largenumber of attributes into a relatively small set of

groups or "segments."Example : Buying habits of multiple population

segments might be compared to determine whichsegments to target for a new sales campaign.

Classification


210/301

Most commonly applied data miningtechnique

Algorithm uses preclassified examples todetermine the set of parameters required forproper discrimination.

Example : A classifier derived from theClassification approach is capable of

identifying risky loans, could be used to aid inthe decision of whether to grant a loan to anindividual.

Issues of Data Mining


211/301

Present-day tools are strong but requiresignificant expertise to implement effectively.

Issues of Data Mining

Susceptibility to "dirty" or irrelevant data.Inability to "explain" results in human terms.

Issues


212/301

susceptibility to "dirty" or irrelevant dataData mining tools of today simply take everything

they are given as factual and draw the resulting

conclusions.

Users must take the necessary precautions to

ensure that the data being analyzed is "clean."

Issues, cont


213/301

inability to "explain" results in human termsMany of the tools employed in data mining

analysis use complex mathematical algorithms that

are not easily mapped into human terms.

what good does the information do if you dont

understand it?

Comparison with reporting, BI andOLAP


214/301

Reporting

Simplerelationships

Choose therelevant factors

Examine alldetails

(Also applies tovisualisation &simple statistics)

Data MiningComplex

relationships

Automatically find

the relevant factorsShow only relevant

details

Prediction

Comparison with Statistics


215/301

Statistical analysisMainly about

hypothesis testing

Focussed on

precision

Data miningMainly about

hypothesisgeneration

Focussed ondeployment

Example: data mining and customerprocesses


216/301

Insight: Who are my customers andwhy do they behave the way theydo?

Prediction: Who is a good prospect,for what product, who is at risk,what is the next thing to offer?

Uses: Targeted marketing, mail-shots, call-centres, adaptive web-sites

Example: data mining and frauddetection


217/301

Insight: How can (specificmethod of) fraud berecognised? What constitute

normal, abnormal andsuspicious events?

Prediction: Recognisesimilarity to previous frauds

how similar?Spot abnormal events howsuspicious?

Example: data mining anddiagnosing cancer


218/301

Complex data from geneticsChallenging data mining problem

Find patterns of gene activation

indicating different diseases / stagesChanged the way I think about

cancerOncologist from Chicago Childrens

Memorial Hospital

Example: data mining and policing


219/301

Knowing the patterns helps planeffective crime prevention

Crime hot-spots understood better

Sift through mountains of crimereports

Identify crime series

Other people save money usingdata mining we save lives.Policeforce homicide specialist and data miner

Data mining tools:Clementine and its philosophy


220/301

How to do data mining


221/301

Lots of data mining operationsHow do you glue them together to

solve a problem?

How do we actually do data mining?Methodology

Not just the right way, but any way

Myths about Data Mining (1)Data, Process and Tech


222/301

Data mining is all about

massive data

It can be, but some importantdatasets are very small, and

sampling is often appropriate

Data mining is atechnical process

Business analysts perform

data mining every dayIt is a business process

Data mining is all

about algorithms

Algorithms are a key toolBut data mining is done by

people, not by algorithms

Data mining is all

about predictive accuracy

It's about usefulnessAccuracy is only a small

component

Myths about Data Mining (2)Data Quality


223/301

Data mining only works

with clean data

Cleaning the data is partof the data mining process

Need not be clean initially

Data mining only works

with complete data

Data mining works withwhatever data you have.Complete is good,

incomplete is also ok.

Data mining only workswith correct

very good minng

Documents