very good minng

Upload: gaurav-singh

Post on 14-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 Very Good Minng

    1/301

    Data Mining Tools

    Overview & Tutorial

    Ahmed Sameh

    Prince Sultan University

    Department of Computer Science &Info Sys

    May 2010(Some slides belong to IBM)

    1

  • 7/30/2019 Very Good Minng

    2/301

    2

    Introduction Outline

    Define data mining

    Data mining vs. databases

    Basic data mining tasks

    Data mining development

    Data mining issues

    Goal: Provide an overview of data mining.

  • 7/30/2019 Very Good Minng

    3/301

    3

    Introduction

    Data is growing at a phenomenalrate

    Users expect more sophisticatedinformation

    How?

    UNCOVER HIDDEN INFORMATION

    DATA MINING

  • 7/30/2019 Very Good Minng

    4/301

    4

    Data Mining Definition

    Finding hidden information in adatabase

    Fit data to a model

    Similar terms

    Exploratory data analysis

    Data driven discovery

    Deductive learning

  • 7/30/2019 Very Good Minng

    5/301

    5

    Data Mining Algorithm

    Objective: Fit Data to a Model

    Descriptive

    PredictivePreference Technique to choose

    the best model

    Search Technique to search thedata

    Query

  • 7/30/2019 Very Good Minng

    6/301

    6

    Database Processing vs. DataMining Processing

    QueryWell defined

    SQL

    Query

    Poorly defined

    No precise querylanguage

    DataOperational data

    OutputPrecise

    Subset of database

    DataNot operational data

    OutputFuzzy

    Not a subset of database

  • 7/30/2019 Very Good Minng

    7/301

    7

    Query Examples

    Database

    Data MiningFind all customers who have purchased milk

    Find all items which are frequently purchased withmilk. (association rules)

    Find all credit applicants with last name of Smith.Identify customers who have purchased more than

    $10,000 in the last month.

    Find all credit applicants who are poor creditrisks. (classification)

    Identify customers with similar buying habits.(Clustering)

  • 7/30/2019 Very Good Minng

    8/301

    8

    Related Fields

    Statistics

    MachineLearning

    Databases

    Visualization

    Data Mining andKnowledge Discovery

  • 7/30/2019 Very Good Minng

    9/301

    9

    Statistics, Machine Learningand Data Mining Statistics:

    more theory-based more focused on testing hypotheses

    Machine learning more heuristic

    focused on improving performance of a learning agent also looks at real-time learning and robotics areas not part

    of data mining

    Data Mining and Knowledge Discovery integrates theory and heuristics focus on the entire process of knowledge discovery,

    including data cleaning, learning, and integration andvisualization of results

    Distinctions are fuzzy

  • 7/30/2019 Very Good Minng

    10/301

    Definition

    A class of database application that analyze

    data in a database using tools which look

    for trends or anomalies.

    Data mining was invented by IBM.

  • 7/30/2019 Very Good Minng

    11/301

    Purpose

    To look for hidden patterns or previously

    unknown relationships among the data in a

    group of data that can be used to predict future

    behavior.

    Ex: Data mining software can help retail

    companies find customers with common

    interests.

  • 7/30/2019 Very Good Minng

    12/301

    Background Information

    Many of the techniques used by today's data

    mining tools have been around for many years,

    having originated in the artificial intelligence

    research of the 1980s and early 1990s.

    Data Mining tools are only now being applied

    to large-scale database systems.

  • 7/30/2019 Very Good Minng

    13/301

    The Need for Data Mining

    The amount of raw data stored in corporate

    data warehouses is growing rapidly.

    There is too much data and complexity thatmight be relevant to a specific problem.

    Data mining promises to bridge the analytical

    gap by giving knowledgeworkers the tools to

    navigate this complex analytical space.

  • 7/30/2019 Very Good Minng

    14/301

    The Need for Data Mining, cont

    The need for information has resulted in the

    proliferation of data warehouses that integrate

    information multiple sources to support

    decision making.

    Often include data from external sources, such

    as customer demographics and household

    information.

  • 7/30/2019 Very Good Minng

    15/301

    Definition (Cont.)

    Data mining is the exploration and analysis of large quantitiesof data in order to discover valid, novel, potentially useful,and ultimately understandable patterns in data.

    Valid: The patterns hold in general.

    Novel: We did not know the patternbeforehand.

    Useful: We can devise actions from thepatterns.

    Understandable: We can interpret andcomprehend the patterns.

  • 7/30/2019 Very Good Minng

    16/301

    Of laws, Monsters, and GiantsMoores law: processing capacity doubles

    every 18 months : CPU, cache, memoryIts more aggressive cousin:Disk storage capacity doubles every 9

    months

    1E+3

    1E+4

    1E+5

    1E+6

    1E+7

    1988 1991 1994 1997 2000

    disk TB

    growth:

    112%/y

    Moore's Law:

    58.7%/y

    ExaByte

    Disk TB Shipped per Year1998 Disk Trend (Jim Port er)

    ht t :/ /www.d iskt rend .com/ d f/ o rt r k . d f.

    What do the twolaws combined

    produce?

    A rapidly growing

    gap between our

    ability to generate

    data, and our ability

  • 7/30/2019 Very Good Minng

    17/301

    What is Data Mining?

    Finding interesting structure indata

    Structure: refers to statistical patterns,predictive models, hidden relationships

    Examples of tasks addressed by Data Mining

    Predictive Modeling (classification,regression)

    Segmentation (Data Clustering )

    Summarization

  • 7/30/2019 Very Good Minng

    18/301

  • 7/30/2019 Very Good Minng

    19/301

    19

    Major Application Areas forData Mining Solutions

    Advertising Bioinformatics Customer Relationship Management (CRM)Database Marketing Fraud Detection

    eCommerce Health Care Investment/SecuritiesManufacturing, Process Control Sports and Entertainment

    TelecommunicationsWeb

  • 7/30/2019 Very Good Minng

    20/301

    20

    Data Mining

    The non-trivial extraction of novel, implicit, andactionable knowledge from large datasets.

    Extremely large datasets

    Discovery of the non-obvious

    Useful knowledge that can improve processesCan not be done manually

    Technology to enable data exploration, data analysis,and data visualization of very large databases at a highlevel of abstraction, without a specific hypothesis in

    mind. Sophisticated data search capability that uses statisticalalgorithms to discover patterns and correlations in data.

  • 7/30/2019 Very Good Minng

    21/301

    21

    Data Mining (cont.)

  • 7/30/2019 Very Good Minng

    22/301

    22

    Data Mining (cont.)

    Data Mining is a step of Knowledge Discoveryin Databases (KDD) Process

    Data Warehousing

    Data SelectionData Preprocessing

    Data Transformation

    Data Mining

    Interpretation/EvaluationData Mining is sometimes referred to as KDD

    and DM and KDD tend to be used assynonyms

  • 7/30/2019 Very Good Minng

    23/301

    23

    Data Mining Evaluation

  • 7/30/2019 Very Good Minng

    24/301

    24

    Data Mining is Not

    Data warehousing

    SQL / Ad Hoc Queries / Reporting

    Software AgentsOnline Analytical Processing (OLAP)

    Data Visualization

  • 7/30/2019 Very Good Minng

    25/301

    25

    Data Mining Motivation

    Changes in the Business Environment

    Customers becoming more demanding

    Markets are saturated

    Databases today are huge:More than 1,000,000 entities/records/rows

    From 10 to 10,000 fields/attributes/variables

    Gigabytes and terabytes

    Databases a growing at an unprecedentedrate

    Decisions must be made rapidly

    Decisions must be made with maximumknowledge

  • 7/30/2019 Very Good Minng

    26/301

    Why Use Data Mining Today?

    Human analysis skills are inadequate:

    Volume and dimensionality of the data

    High data growth rate

    Availability of:

    Data

    StorageComputational power

    Off-the-shelf software

    Expertise

  • 7/30/2019 Very Good Minng

    27/301

    An Abundance of Data

    Supermarket scanners, POS data

    Preferred customer cards

    Credit card transactions

    Direct mail response

    Call center records

    ATM machines

    Demographic data

    Sensor networks Cameras

    Web server logs

    Customer web site trails

  • 7/30/2019 Very Good Minng

    28/301

    Evolution of Database Technology

    1960s: IMS, network model 1970s: The relational data model, first relational

    DBMS implementations 1980s: Maturing RDBMS, application-specific

    DBMS, (spatial data, scientific data, image data,etc.), OODBMS 1990s: Mature, high-performance RDBMS

    technology, parallel DBMS, terabyte datawarehouses, object-relational DBMS, middlewareand web technology

    2000s: High availability, zero-administration,seamless integration into business processes

    2010: Sensor database systems, databases onembedded systems, P2P database systems,

    large-scale pub/sub systems, ???

  • 7/30/2019 Very Good Minng

    29/301

    Much Commercial Support

    Many data mining tools

    http://www.kdnuggets.com/software

    Database systems with data miningsupport

    Visualization tools

    Data mining process supportConsultants

    http://www.kdnuggets.com/softwarehttp://www.kdnuggets.com/software
  • 7/30/2019 Very Good Minng

    30/301

    Why Use Data Mining Today?

    Competitive pressure!

    The secret of success is to know something thatnobody else knows.

    Aristotle Onassis

    Competition on service, not only on price (Banks,phone companies, hotel chains, rental carcompanies)

    Personalization, CRM The real-time enterprise

    Systemic listening

    Security, homeland defense

  • 7/30/2019 Very Good Minng

    31/301

    The Knowledge Discovery Process

    Steps:

    1. Identify business problem

    2. Data mining3. Action

    4. Evaluation and measurement

    5. Deployment and integration intobusinesses processes

  • 7/30/2019 Very Good Minng

    32/301

    Data Mining Step in Detail

    2.1 Data preprocessing Data selection: Identify target

    datasets and relevant fields

    Data cleaning Remove noise and outliers

    Data transformation

    Create common units

    Generate new fields

    2.2 Data mining model construction

    2.3 Model evaluation

  • 7/30/2019 Very Good Minng

    33/301

    Preprocessing and Mining

    Original Data

    TargetData

    Preprocessed

    Data

    PatternsKnowledge

    Data

    Integration

    and Selection

    Preprocessing

    Model

    Construction

    Interpretation

  • 7/30/2019 Very Good Minng

    34/301

    34

    Data Mining Techniques

    Data Mining Techniques

    Descriptive Predictive

    Clustering

    Association

    Classification

    Regression

    Sequential Analysis

    Decision Tree

    Rule Induction

    Neural Networks

    Nearest Neighbor Classification

  • 7/30/2019 Very Good Minng

    35/301

    35

    Data Mining Models and Tasks

  • 7/30/2019 Very Good Minng

    36/301

    36

    Basic Data Mining TasksClassification maps data into

    predefined groups or classesSupervised learning

    Pattern recognition

    Prediction

    Regression is used to map a data itemto a real valued prediction variable.

    Clustering groups similar data

    together into clusters.Unsupervised learning

    Segmentation

    Partitioning

  • 7/30/2019 Very Good Minng

    37/301

    37

    Basic Data Mining Tasks (contd)

    Summarization maps data into subsetswith associated simple descriptions.

    Characterization

    Generalization

    Link Analysis uncovers relationshipsamong data.

    Affinity Analysis

    Association Rules

    Sequential Analysis determines sequentialpatterns.

  • 7/30/2019 Very Good Minng

    38/301

    38

    Ex: Time Series Analysis

    Example: Stock MarketPredict future values

    Determine similar patterns over time

    Classify behavior

  • 7/30/2019 Very Good Minng

    39/301

    39

    Data Mining vs. KDD

    Knowledge Discovery inDatabases (KDD): process offinding useful information and

    patterns in data.

    Data Mining: Use of algorithms toextract the information and patterns

    derived by the KDD process.

  • 7/30/2019 Very Good Minng

    40/301

    40

    Data Mining DevelopmentSimilarity Measures

    Hierarchical Clustering

    IR SystemsImprecise Queries

    Textual Data

    Web Search Engines

    Bayes TheoremRegression Analysis

    EM Algorithm

    K-Means Clustering

    Time Series Analysis

    Neural Networks

    Decision Tree Algorithms

    Algorithm Design TechniquesAlgorithm AnalysisData Structures

    Relational Data ModelSQL

    Association Rule AlgorithmsData Warehousing

    Scalability Techniques

  • 7/30/2019 Very Good Minng

    41/301

    41

    KDD Issues

    Human InteractionOverfitting

    Outliers

    Interpretation

    Visualization

    Large Datasets

    High Dimensionality

  • 7/30/2019 Very Good Minng

    42/301

    42

    KDD Issues (contd)

    Multimedia Data

    Missing Data

    Irrelevant Data

    Noisy Data

    Changing Data

    IntegrationApplication

  • 7/30/2019 Very Good Minng

    43/301

    43

    Visualization Techniques

    Graphical

    Geometric

    Icon-basedPixel-based

    Hierarchical

    Hybrid

  • 7/30/2019 Very Good Minng

    44/301

    44

    Data Mining Applications

    Data Mining Applications:

  • 7/30/2019 Very Good Minng

    45/301

    45

    Data Mining Applications:Retail

    Performing basket analysisWhich items customers tend to purchase together. This

    knowledge can improve stocking, store layoutstrategies, and promotions.

    Sales forecastingExamining time-based patterns helps retailers make

    stocking decisions. If a customer purchases an itemtoday, when are they likely to purchase acomplementary item?

    Database marketingRetailers can develop profiles of customers with certain

    behaviors, for example, those who purchase designerlabels clothing or those who attend sales. Thisinformation can be used to focus costeffectivepromotions.

    Merchandise planning and allocationWhen retailers add new stores, they can improve

    merchandise planning and allocation by examining

    patterns in stores with similar demographic

    Data Mining Applications:

  • 7/30/2019 Very Good Minng

    46/301

    46

    Data Mining Applications:Banking

    Card marketingBy identifying customer segments, card issuers and

    acquirers can improve profitability with more effectiveacquisition and retention programs, targeted productdevelopment, and customized pricing.

    Cardholder pricing and profitabilityCard issuers can take advantage of data mining

    technology to price their products so as to maximizeprofit and minimize loss of customers. Includes risk-based pricing.

    Fraud detection

    Fraud is enormously costly. By analyzing pasttransactions that were later determined to befraudulent, banks can identify patterns.

    Predictive life-cycle managementDM helps banks predict each customers lifetime value

    and to service each segment appropriately (for example,

    offering special deals and discounts).

    Data Mining Applications:

  • 7/30/2019 Very Good Minng

    47/301

    47

    Data Mining Applications:Telecommunication

    Call detail record analysis

    Telecommunication companies accumulate detailedcall records. By identifying customer segments withsimilar use patterns, the companies can develop

    attractive pricing and feature promotions.Customer loyalty

    Some customers repeatedly switch providers, orchurn, to take advantage of attractive incentives

    by competing companies. The companies can useDM to identify the characteristics of customers whoare likely to remain loyal once they switch, thusenabling the companies to target their spending oncustomers who will produce the most profit.

    Data Mining Applications:

  • 7/30/2019 Very Good Minng

    48/301

    48

    Data Mining Applications:Other Applications

    Customer segmentationAll industries can take advantage of DM to discover

    discrete segments in their customer bases byconsidering additional variables beyond traditionalanalysis.

    ManufacturingThrough choice boards, manufacturers are beginning to

    customize products for customers; therefore they mustbe able to predict which features should be bundled tomeet customer demand.

    WarrantiesManufacturers need to predict the number of customers

    who will submit warranty claims and the average cost ofthose claims.

    Frequent flier incentives

    Airlines can identify groups of customers that can begiven incentives to fly more.

  • 7/30/2019 Very Good Minng

    49/301

    49

    Which are ourlowest/highest margin

    customers ?

    Who are my customersand what products

    are they buying?

    Which customers

    are most likely to goto the competition ?

    What impact willnew products/services

    have on revenue

    and margins?

    What product prom-

    -otions have the biggestimpact on revenue?

    What is the most

    effective distributionchannel?

    A producer wants to know.

    Data Data everywhere

  • 7/30/2019 Very Good Minng

    50/301

    50

    Data, Data everywhereyet ...

    I cant find the data I need

    data is scattered over thenetwork

    many versions, subtledifferences

    I cant get the data I need

    need an expert to get the data

    I cant understand the data Ifound

    available data poorly documented

    I cant use the data I found

    results are unexpected

    data needs to be transformed

    from one form to other

  • 7/30/2019 Very Good Minng

    51/301

    51

    What is a Data Warehouse?

    A single, complete andconsistent store of dataobtained from a variety

    of different sourcesmade available to endusers in a what theycan understand and use

    in a business context.

    [Barry Devlin]

  • 7/30/2019 Very Good Minng

    52/301

    52

    What are the users saying...

    Data should be integratedacross the enterprise

    Summary data has a real

    value to the organizationHistorical data holds the

    key to understanding dataover time

    What-if capabilities arerequired

  • 7/30/2019 Very Good Minng

    53/301

    53

    What is Data Warehousing?

    A process of

    transforming data intoinformation and

    making it available tousers in a timelyenough manner to

    make a difference

    [Forrester Research, April1996]Data

    Information

  • 7/30/2019 Very Good Minng

    54/301

    54

    Very Large Data Bases

    Terabytes -- 10^12 bytes:

    Petabytes -- 10^15 bytes:

    Exabytes -- 10^18 bytes:

    Zettabytes -- 10^21bytes:

    Zottabytes -- 10^24bytes:

    Walmart -- 24 Terabytes

    Geographic InformationSystems

    National Medical Records

    Weather images

    Intelligence AgencyVideos

    Data Warehousing

  • 7/30/2019 Very Good Minng

    55/301

    55

    Data Warehousing --It is a process

    Technique for assembling andmanaging data from varioussources for the purpose of

    answering businessquestions. Thus makingdecisions that were notprevious possible

    A decision support databasemaintained separately fromthe organizations operationaldatabase

  • 7/30/2019 Very Good Minng

    56/301

    56

    Data Warehouse

    A data warehouse is a

    subject-oriented

    integrated

    time-varying

    non-volatile

    collection of data that is used primarily in

    organizational decision making.

    -- Bill Inmon, Building the Data Warehouse 1996

  • 7/30/2019 Very Good Minng

    57/301

    Data Warehousing Concepts

    Decision support is key for companies wantingto turn their organizational data into aninformation asset

    Traditional database is transaction-oriented

    while data warehouse is data-retrievaloptimized for decision-support Data Warehouse

    "A subject-oriented, integrated, time-variant,and non-volatile collection of data in support ofmanagement's decision-making process"

    OLAP (on-line analytical processing), DecisionSupport Systems (DSS), Executive InformationSystems (EIS), and data mining applications

    57

    What does data warehouse do?

  • 7/30/2019 Very Good Minng

    58/301

    What does data warehouse do?

    integrate diverse information fromvarious systems which enable users toquickly produce powerful ad-hoc queriesand perform complex analysis

    create an infrastructure for reusing thedata in numerous ways

    create an open systems environment tomake useful information easily accessibleto authorized users

    help managers make informed decisions

    58

  • 7/30/2019 Very Good Minng

    59/301

    Benefits of Data Warehousing

    Potential high returns on investment

    Competitive advantage

    Increased productivity of corporatedecision-makers

    59

    Comparison of OLTP and Data Warehousing

  • 7/30/2019 Very Good Minng

    60/301

    Comparison of OLTP and Data Warehousing

    OLTP systems Data warehousingsystemsHolds current data Holds historic dataStores detailed data Stores detailed, lightly, and

    summarized data

    Data is dynamic Data is largely staticRepetitive processing Ad hoc, unstructured, andheuristic processingHigh level of transaction throughput Medium to low transactionthroughputPredictable pattern of usage Unpredictable pattern of usageTransaction driven Analysis driven

    Application oriented Subject orientedSupports day-to-day decisions Supports strategic decisionsServes large number of Serves relatively lower numberclerical / operational users of managerial users

    60

  • 7/30/2019 Very Good Minng

    61/301

    Data Warehouse Architecture

    Operational Data Load Manager Warehouse Manager

    Query Manager Detailed Data Lightly and Highly Summarized Data Archive / Backup Data Meta-Data End-user Access Tools

    61

  • 7/30/2019 Very Good Minng

    62/301

    End-user Access Tools

    Reporting and query tools

    Application development tools

    Executive Information System (EIS)tools

    Online Analytical Processing (OLAP)

    toolsData mining tools

    62

    Data Warehousing Tools and Technologies

  • 7/30/2019 Very Good Minng

    63/301

    Data Warehousing Tools and Technologies

    Extraction, Cleansing, and TransformationTools

    Data Warehouse DBMS Load performance

    Load processing Data quality management Query performance Terabyte scalability Networked data warehouse

    Warehouse administration Integrated dimensional tools Advanced query functionality

    63

  • 7/30/2019 Very Good Minng

    64/301

    Data Marts

    A subset of data warehouse thatsupports the requirements of aparticular department or business

    function

    64

  • 7/30/2019 Very Good Minng

    65/301

    Online Analytical Processing (OLAP)

    OLAP

    The dynamic synthesis, analysis, andconsolidation of large volume of multi-

    dimensional data

    Multi-dimensional OLAP

    Cubes of data

    65

    Time

    City

    Produ

    ct

    type

  • 7/30/2019 Very Good Minng

    66/301

    Problems of Data Warehousing

    Underestimation of resources fordata loading

    Hidden problem with source systems

    Required data not capturedIncreased end-user demandsData homogenizationHigh demand for resourcesData ownershipHigh maintenanceLong duration projects

    Com lexit of inte ration 66

  • 7/30/2019 Very Good Minng

    67/301

    Codd's Rules for OLAP

    Multi-dimensional conceptual view Transparency Accessibility Consistent reporting performance

    Client-server architecture Generic dimensionality Dynamic sparse matrix handling Multi-user support Unrestricted cross-dimensional operations Intuitive data manipulation Flexible reporting Unlimited dimensions and aggregation levels

    67

  • 7/30/2019 Very Good Minng

    68/301

    OLAP Tools

    Multi-dimensional OLAP (MOLAP)

    Multi-dimensional DBMS (MDDBMS)

    Relational OLAP (ROLAP)

    Creation of multiple multi-dimensionalviews of the two-dimensional relations

    Managed Query Environment (MQE)

    Deliver selected data directly from theDBMS to the desktop in the form of adata cube, where it is stored, analyzed,

    and manipulated locally 68

  • 7/30/2019 Very Good Minng

    69/301

    Data Mining

    Definition The process of extracting valid, previously

    unknown, comprehensible, and actionableinformation from large database and usingit to make crucial business decisions

    Knowledge discovery Association rules Sequential patterns Classification trees

    Goals

    Prediction Identification Classification Optimization

    69

  • 7/30/2019 Very Good Minng

    70/301

    Data Mining Techniques

    Predictive Modeling

    Supervised training with two phases

    Training phase : building a model using

    large sample of historical data calledthe training set

    Testing phase : trying the model on

    new dataDatabase Segmentation

    Link Analysis

    Deviation Detection 70

  • 7/30/2019 Very Good Minng

    71/301

    What are Data Mining Tasks?

    Classification

    Regression

    Clustering

    Summarization

    Dependency modeling

    Change and Deviation Detection

    71

  • 7/30/2019 Very Good Minng

    72/301

    What are Data Mining Discoveries?

    New Purchase Trends

    Plan Investment Strategies

    Detect Unauthorized Expenditure

    Fraudulent Activities

    Crime Trends

    Smugglers-border crossing

    72

  • 7/30/2019 Very Good Minng

    73/301

    73

    Data Warehouse Architecture

    Data Warehouse

    Engine

    Optimized Loader

    Extraction

    Cleansing

    Analyze

    Query

    Metadata Repository

    Relational

    Databases

    Legacy

    Data

    Purchased

    Data

    ERP

    Systems

    Data Warehouse for Decision

  • 7/30/2019 Very Good Minng

    74/301

    74

    Data Warehouse for DecisionSupport & OLAP

    Putting Information technology to help the

    knowledge worker make faster and better

    decisions

    Which of my customers are most likely to goto the competition?

    What product promotions have the biggest

    impact on revenue?

    How did the share price of software

    companies correlate with profits over last 10

    years?

  • 7/30/2019 Very Good Minng

    75/301

    75

    Decision Support

    Used to manage and control business

    Data is historical or point-in-time

    Optimized for inquiry rather than updateUse of the system is loosely defined and

    can be ad-hoc

    Used by managers and end-users tounderstand the business and make

    judgements

    Data Mining works with Warehouse

  • 7/30/2019 Very Good Minng

    76/301

    76

    gData

    Data Warehousingprovides the Enterprisewith a memory

    Data Mining providesthe Enterprise withintelligence

  • 7/30/2019 Very Good Minng

    77/301

    77

    We want to know ... Given a database of 100,000 names, which persons are the

    least likely to default on their credit cards? Which types of transactions are likely to be fraudulent

    given the demographics and transactional history of aparticular customer?

    If I raise the price of my product by Rs. 2, what is the

    effect on my ROI? If I offer only 2,500 airline miles as an incentive to

    purchase rather than 5,000, how many lost responses willresult?

    If I emphasize ease-of-use of the product as opposed to its

    technical capabilities, what will be the net effect on myrevenues?

    Which of my customers are likely to be the most loyal?

    Data Mining helps extract such information

    A li ti A

  • 7/30/2019 Very Good Minng

    78/301

    78

    Application Areas

    Industry Application

    Finance Credit Card Analysis

    Insurance Claims, Fraud Analysis

    Telecommunication Call record analysis

    Transport Logistics management

    Consumer goods promotion analysis

    Data Service providers Value added dataUtilities Power usage analysis

  • 7/30/2019 Very Good Minng

    79/301

    79

    Data Mining in Use

    The US Government uses Data Mining totrack fraud

    A Supermarket becomes an information

    brokerBasketball teams use it to track game

    strategy

    Cross Selling

    Warranty Claims Routing

    Holding on to Good Customers

    Weeding out Bad Customers

  • 7/30/2019 Very Good Minng

    80/301

    80

    What makes data mining possible?

    Advances in the following areas aremaking data mining deployable:

    data warehousing

    better and more data (i.e., operational,behavioral, and demographic)

    the emergence of easily deployed data

    mining tools andthe advent of new data mining

    techniques. -- Gartner Group

  • 7/30/2019 Very Good Minng

    81/301

    81

    Why Separate Data Warehouse?

    Performance

    Op dbs designed & tuned for known txs & workloads.

    Complex OLAP queries would degrade perf. for op txs.

    Special data organization, access & implementation

    methods needed for multidimensional views & queries.

    Function

    Missing data: Decision support requires historical data, whichop dbs do not typically maintain.

    Data consolidation: Decision support requires consolidation(aggregation, summarization) of data from manyheterogeneous sources: op dbs, external sources.

    Data quality: Different sources typically use inconsistent datarepresentations, codes, and formats which have to bereconciled.

  • 7/30/2019 Very Good Minng

    82/301

    82

    What are Operational Systems?

    They are OLTP systems

    Run mission criticalapplications

    Need to work withstringent performancerequirements forroutine tasks

    Used to run abusiness!

    RDBMS used for OLTP

  • 7/30/2019 Very Good Minng

    83/301

    83

    RDBMS used for OLTP

    Database Systems have been usedtraditionally for OLTP

    clerical data processing tasks

    detailed, up to date data

    structured repetitive tasks

    read/update a few records

    isolation, recovery and integrity arecritical

  • 7/30/2019 Very Good Minng

    84/301

    84

    Operational Systems

    Run the business in real time

    Based on up-to-the-second data

    Optimized to handle largenumbers of simple read/write

    transactionsOptimized for fast response to

    predefined transactions

    Used by people who deal withcustomers, products -- clerks,salespeople etc.

    They are increasingly used bycustomers

  • 7/30/2019 Very Good Minng

    85/301

    85

    Examples of Operational Data

    Data Industry Usage Technology VolumesCustomerFile All TrackCustomer

    DetailsLegacy application, flatfiles, main frames Small-medium

    AccountBalance Finance

    Controlaccountactivities

    Legacy applications,hierarchical databases,mainframe

    Large

    Point-of-Sale data Retail Generatebills, manage

    stockERP, Client/Server,relational databases Very Large

    CallRecord Telecomm-unications Billing Legacy application,hierarchical database,

    mainframeVery Large

    ProductionRecord Manufact-uring ControlProduction ERP,relational databases,

    AS/400Medium

    Application-Orientation vs.

  • 7/30/2019 Very Good Minng

    86/301

    86

    ppSubject-Orientation

    Application-Orientation

    Operational

    Database

    LoansCreditCard

    Trust

    Savings

    Subject-Orientation

    Data

    Warehouse

    Customer

    VendorProduct

    Activity

    OLTP vs Data Warehouse

  • 7/30/2019 Very Good Minng

    87/301

    87

    OLTP vs. Data Warehouse

    OLTP systems are tuned for knowntransactions and workloads whileworkload is not known a priori in a data

    warehouseSpecial data organization, access methods

    and implementation methods are neededto support data warehouse queries

    (typically multidimensional queries)e.g., average amount spent on phone calls

    between 9AM-5PM in Pune during the monthof December

    OLTP vs Data Warehouse

  • 7/30/2019 Very Good Minng

    88/301

    88

    OLTP vs Data Warehouse

    OLTP

    ApplicationOriented

    Used to runbusiness

    Detailed data

    Current up to date

    Isolated DataRepetitive access

    Clerical User

    Warehouse (DSS)

    Subject Oriented

    Used to analyze

    businessSummarized and

    refined

    Snapshot data

    Integrated DataAd-hoc access

    Knowledge User(Manager)

    OLTP vs Data Warehouse

  • 7/30/2019 Very Good Minng

    89/301

    89

    OLTP vs Data Warehouse

    OLTP

    Performance Sensitive

    Few Records accessed ata time (tens)

    Read/Update Access

    No data redundancy

    Database Size 100MB-100 GB

    Data Warehouse

    Performance relaxed

    Large volumes accessedat a time(millions)

    Mostly Read (BatchUpdate)

    Redundancy present

    Database Size

    100 GB - few terabytes

    OLTP vs Data Warehouse

  • 7/30/2019 Very Good Minng

    90/301

    90

    OLTP vs Data Warehouse

    OLTP

    Transactionthroughput is theperformance metric

    Thousands of users

    Managed inentirety

    Data Warehouse

    Query throughputis the performancemetric

    Hundreds of users

    Managed bysubsets

  • 7/30/2019 Very Good Minng

    91/301

    91

    To summarize ...

    OLTP Systems areused to runabusiness

    The DataWarehouse helpsto optimizethebusiness

  • 7/30/2019 Very Good Minng

    92/301

    92

    Why Now?

    Data is being produced

    ERP provides clean data

    The computing power is available

    The computing power is affordable

    The competitive pressures are

    strongCommercial products are available

    Myths surrounding OLAP Serversd

  • 7/30/2019 Very Good Minng

    93/301

    93

    and Data Marts

    Data marts and OLAP servers are departmental

    solutions supporting a handful of users

    Million dollar massively parallel hardware is

    needed to deliver fast time for complex queries

    OLAP servers require massive and unwieldy

    indices

    Complex OLAP queries clog the network with

    dataData warehouses must be at least 100 GB to be

    effective

    Source -- Arbor Software Home Page

  • 7/30/2019 Very Good Minng

    94/301

    II. On-Line Analytical Processing (OLAP)

    Making Decision

    Support Possible

    T l OL P Q

  • 7/30/2019 Very Good Minng

    95/301

    95

    Typical OLAP Queries

    Write a multi-table join to compare sales for each

    product line YTD this year vs. last year.

    Repeat the above process to find the top 5

    product contributors to margin.

    Repeat the above process to find the sales of a

    product line to new vs. existing customers.

    Repeat the above process to find the customers

    that have had negative sales growth.

    What Is OLAP?

  • 7/30/2019 Very Good Minng

    96/301

    96

    * Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html

    What Is OLAP?

    Online Analytical Processing - coined byEF Codd in 1994 paper contracted byArbor Software*

    Generally synonymous with earlier terms such asDecisions Support, Business Intelligence, ExecutiveInformation System

    OLAP = Multidimensional Database

    MOLAP: Multidimensional OLAP (Arbor Essbase,Oracle Express)

    ROLAP: Relational OLAP (Informix MetaCube,Microstrategy DSS Agent)

    Th OLAP M k

  • 7/30/2019 Very Good Minng

    97/301

    97

    The OLAP Market

    Rapid growth in the enterprise market1995: $700 Million1997: $2.1 Billion

    Significant consolidation activity among

    major DBMS vendors10/94: Sybase acquires ExpressWay7/95: Oracle acquires Express11/95: Informix acquires Metacube1/97: Arbor partners up with IBM10/96: Microsoft acquires Panorama

    Result: OLAP shifted from small verticalniche to mainstream DBMS category

    St th f OLAP

  • 7/30/2019 Very Good Minng

    98/301

    98

    Strengths of OLAP

    It is a powerful visualization paradigm

    It provides fast, interactive response

    timesIt is good for analyzing time series

    It can be useful to find some clusters and

    outliers

    Many vendors offer OLAP tools

    OLAP I FASMI

  • 7/30/2019 Very Good Minng

    99/301

    99

    Nigel Pendse, Richard Creath - The OLAP Report

    OLAP Is FASMI

    Fast

    Analysis

    Shared

    Multidimensional

    Information

  • 7/30/2019 Very Good Minng

    100/301

    100

    Month

    1 2 3 4 765

    P

    roduct

    Toothpaste

    JuiceCola

    Milk

    Cream

    Soap

    WS

    N

    Dimensions: Product, Region, Time

    Hierarchical summarization paths

    Product Region Time

    Industry Country Year

    Category Region Quarter

    Product City Month Week

    Office Day

    Multi-dimensional Data

    HeyI sold $100M worth of goods

    A Vi l O ti Pi t (R t t )

  • 7/30/2019 Very Good Minng

    101/301

    101

    A Visual Operation: Pivot (Rotate)

    10

    47

    30

    12

    Juice

    Cola

    Milk

    Cream

    3/1 3/2 3/3 3/4

    Date

    Product

    Sli i d Di i

  • 7/30/2019 Very Good Minng

    102/301

    102

    Slicing and Dicing

    Product

    Sales ChannelRetail Direct Special

    Household

    Telecomm

    Video

    Audio IndiaFar East

    Europe

    The Telecomm Slice

    R ll d D ill D

  • 7/30/2019 Very Good Minng

    103/301

    103

    Roll-up and Drill Down

    Sales Channel

    Region

    Country

    State

    Location Address

    SalesRepresentative

    Higher Level ofAggregation

    Low-levelDetails

  • 7/30/2019 Very Good Minng

    104/301

    Results of Data Mining Include:

    Forecasting what may happen in thefuture

    Classifying people or things intogroups by recognizing patterns

    Clustering people or things intogroups based on their attributes

    Associating what events are likely to

    occur togetherSequencing what events are likely to

    lead to later events

  • 7/30/2019 Very Good Minng

    105/301

    Data mining is not

    Brute-force crunching ofbulk dataBlind application ofalgorithmsGoing to find relationships

    where none existPresenting data in differentwaysA database intensive taskA difficult to understandtechnology requiring anadvanced degree incomputer science

  • 7/30/2019 Very Good Minng

    106/301

    Data Mining versus OLAP

    OLAP - On-lineAnalyticalProcessingProvides you

    with a verygood view ofwhat ishappening,but can notpredict whatwill happen inthe future orwhy it ishappening

    Data Mining Versus StatisticalAnalysis

  • 7/30/2019 Very Good Minng

    107/301

    AnalysisData Mining

    Originally developed to actas expert systems to solveproblems

    Less interested in themechanics of thetechnique

    If it makes sense thenlets use it

    Does not requireassumptions to be madeabout data

    Can find patterns in verylarge amounts of data

    Requires understandingof data and businessproblem

    Data Analysis

    Tests for statisticalcorrectness of models Are statistical

    assumptions of modelscorrect? Eg Is the R-Square

    good? Hypothesis testing

    Is the relationshipsignificant? Use a t-test to validate

    significance Tends to rely on sampling Techniques are not

    optimised for largeamounts of data

    Requires strong statisticalskills

    Examples of What People are

  • 7/30/2019 Very Good Minng

    108/301

    p pDoing with Data Mining:

    Fraud/Non-ComplianceAnomaly detection

    Isolate the factors that

    lead to fraud, waste and

    abuse

    Target auditing and

    investigative efforts

    more effectively

    Credit/Risk Scoring

    Intrusion detectionParts failure prediction

    Recruiting/Attractingcustomers

    Maximizingprofitability (crossselling, identifying

    profitable customers)Service Delivery andCustomer Retention

    Build profiles ofcustomers likelyto use which

    servicesWeb Mining

  • 7/30/2019 Very Good Minng

    109/301

    What data mining has done for...

    Scheduled its workforce

    to provide faster, more accurateanswers to questions.

    The US Internal Revenue Service

    needed to improve customerservice and...

  • 7/30/2019 Very Good Minng

    110/301

    What data mining has done for...

    analyzed suspects cell phoneusage to focus investigations.

    The US Drug Enforcement

    Agency needed to be more

    effective in their drug bustsand

  • 7/30/2019 Very Good Minng

    111/301

    What data mining has done for...

    Reduced direct mail costs by 30%

    while garnering 95% of the

    campaigns revenue.

    HSBC need to cross-sell more

    effectively by identifying profiles

    that would be interested in higheryielding investments and...

    Suggestion:Predicting Washington

  • 7/30/2019 Very Good Minng

    112/301

    Suggestion:Predicting Washington

    C-Span has lunched a digitalarchieve of 500,000 hours of audiodebates.

    Text Mining or Audio Mining of thesetalks to reveal cwetrain questionssuch as.

    Example Application: Sports

  • 7/30/2019 Very Good Minng

    113/301

    Example Application: Sports

    IBM Advanced Scout analyzesNBA game statistics

    Shots blocked

    Assists

    Fouls

    Google: IBM Advanced Scout

    Advanced Scout

  • 7/30/2019 Very Good Minng

    114/301

    Advanced Scout

    Example pattern: An analysis of thedata from a game played betweenthe New York Knicks and the CharlotteHornets revealed that When Glenn Rice

    played the shooting guard position, heshot 5/6 (83%) on jump shots."

    Pattern is interesting:The average shooting percentage for theCharlotte Hornets during that game was54%.

    Data Mining: Types of Data

  • 7/30/2019 Very Good Minng

    115/301

    Data Mining: Types of Data

    Relational data and transactional dataSpatial and temporal data, spatio-

    temporal observations

    Time-series data

    Text

    Images, video

    Mixtures of data

    Sequence data

    Features from processing other datasources

    Data Mining Techniques

  • 7/30/2019 Very Good Minng

    116/301

    Data Mining Techniques

    Supervised learning

    Classification and regression

    Unsupervised learning

    Clustering

    Dependency modeling

    Associations, summarization, causality

    Outlier and deviation detection

    Trend analysis and change detection

    Different Types of Classifiers

  • 7/30/2019 Very Good Minng

    117/301

    Different Types of Classifiers

    Linear discriminant analysis (LDA)Quadratic discriminant analysis

    (QDA)

    Density estimation methodsNearest neighbor methods

    Logistic regression

    Neural networksFuzzy set theory

    Decision Trees

    Test Sample Estimate

  • 7/30/2019 Very Good Minng

    118/301

    Test Sample Estimate

    Divide D into D1 and D2Use D1 to construct the classifier d

    Then use resubstitution estimateR(d,D2) to calculate the estimatedmisclassification error of d

    Unbiased and efficient, but removes

    D2 from training dataset D

    V-fold Cross Validation

  • 7/30/2019 Very Good Minng

    119/301

    V-fold Cross Validation

    Procedure:Construct classifier d from D

    Partition D into V datasets D1, , DV

    Construct classifier di using D \ DiCalculate the estimated misclassification

    error R(di,Di) of di using test sample DiFinal misclassification estimate:

    Weighted combination of individualmisclassification errors:R(d,D) = 1/V R(di,Di)

    Cross-Validation: Example

  • 7/30/2019 Very Good Minng

    120/301

    Cross-Validation: Example

    d

    d1

    d2

    d3

    Cross-Validation

  • 7/30/2019 Very Good Minng

    121/301

    Cross-Validation

    Misclassification estimate obtainedthrough cross-validation is usuallynearly unbiased

    Costly computation (we need tocompute d, and d1, , dV);computation of di is nearly asexpensive as computation of d

    Preferred method to estimate qualityof learning algorithms in themachine learning literature

    Decision Tree Construction

  • 7/30/2019 Very Good Minng

    122/301

    Decision Tree Construction

    Three algorithmic components:Split selection (CART, C4.5, QUEST,

    CHAID, CRUISE, )

    Pruning (direct stopping rule, testdataset pruning, cost-complexitypruning, statistical tests, bootstrapping)

    Data access (CLOUDS, SLIQ, SPRINT,RainForest, BOAT, UnPivot operator)

    Goodness of a Split

  • 7/30/2019 Very Good Minng

    123/301

    Goodness of a Split

    Consider node t with impurity phi(t)

    The reduction in impuritythroughsplitting predicate s (t splits into

    children nodes tL with impurityphi(tL) and tR with impurity phi(tR))is:

    phi(s,t) = phi(t) pL phi(tL) pRphi(tR)

    Pruning Methods

  • 7/30/2019 Very Good Minng

    124/301

    Pruning Methods

    Test dataset pruning

    Direct stopping rule

    Cost-complexity pruning

    MDL pruning

    Pruning by randomization testing

    Stopping Policies

  • 7/30/2019 Very Good Minng

    125/301

    Stopping Policies

    A stopping policy indicates when furthergrowth of the tree at a node t iscounterproductive.

    All records are of the same class

    The attribute values of all records areidentical

    All records have missing values

    At most one class has a number ofrecords larger than a user-specifiednumber

    All records go to the same child node if t

    is split (only possible with some split

    Test Dataset Pruning

  • 7/30/2019 Very Good Minng

    126/301

    Test Dataset Pruning

    Use an independent test sample Dto estimate the misclassification costusing the resubstitution estimate

    R(T,D) at each nodeSelect the subtree T of T with the

    smallest expected cost

    Missing Values

  • 7/30/2019 Very Good Minng

    127/301

    Missing Values

    What is the problem?During computation of the splitting

    predicate, we can selectively ignore

    records with missing values (note thatthis has some problems)

    But if a record r misses the value of thevariable in the splitting attribute, r can

    not participate further in treeconstruction

    Algorithms for missing values address

    this roblem

    Mean and Mode Imputation

  • 7/30/2019 Very Good Minng

    128/301

    Mean and Mode Imputation

    Assume record r has missing valuer.X, and splitting variable is X.

    Simplest algorithm:

    If X is numerical (categorical), imputethe overall mean (mode)

    Improved algorithm:

    If X is numerical (categorical), imputethe mean(X|t.C) (the mode(X|t.C))

    Decision Trees: Summary

  • 7/30/2019 Very Good Minng

    129/301

    Decision Trees: Summary

    Many application of decision treesThere are many algorithms available for:Split selection

    Pruning

    Handling Missing Values

    Data Access

    Decision tree construction still activeresearch area (after 20+ years!)

    Challenges: Performance, scalability,evolving datasets, new applications

    Supervised vs Unsupervised Learning

  • 7/30/2019 Very Good Minng

    130/301

    Supervised vs. Unsupervised Learning

    Supervised y=F(x): true function

    D: labeled training set

    D: {xi,F(xi)}

    Learn:G(x): model trained topredict labels D

    Goal:E[(F(x)-G(x))2] 0

    Well defined criteria:Accuracy, RMSE, ...

    UnsupervisedGenerator: true model

    D: unlabeled datasample

    D: {xi}

    Learn

    ??????????

    Goal:

    ??????????

    Well defined criteria:

    ??????????

    Clustering: Unsupervised Learning

  • 7/30/2019 Very Good Minng

    131/301

    Clustering Unsupervised Learning

    Given:Data Set D (training set)

    Similarity/distance metric/information

    Find:Partitioning of data

    Groups of similar/close items

    Similarity?

  • 7/30/2019 Very Good Minng

    132/301

    Similarity?

    Groups of similar customersSimilar demographics

    Similar buying behavior

    Similar health

    Similar products

    Similar cost

    Similar function

    Similar store

    Similarity usually is domain/problemspecific

    Clustering: Informal ProblemDefinition

  • 7/30/2019 Very Good Minng

    133/301

    Definition

    Input:A data set ofNrecords each given as a d-

    dimensional data feature vector.

    Output:

    Determine a natural, useful partitioningof the data set into a number of (k)clusters and noise such that we have:High similarity of records within each cluster

    (intra-cluster similarity)

    Low similarity of records between clusters(inter-cluster similarity)

    Types of Clustering

  • 7/30/2019 Very Good Minng

    134/301

    ypes of Cluster ng

    Hard Clustering:Each object is in one and only one

    cluster

    Soft Clustering:Each object has a probability of being

    in each cluster

    Clustering Algorithms

  • 7/30/2019 Very Good Minng

    135/301

    ust r ng gor thms

    Partitioning-based clusteringK-means clustering

    K-medoids clustering

    EM (expectation maximization) clustering

    Hierarchical clustering

    Divisive clustering (top down)

    Agglomerative clustering (bottom up)

    Density-Based MethodsRegions of dense points separated by sparser

    regions of relatively low density

    K-Means Clustering Algorithm

  • 7/30/2019 Very Good Minng

    136/301

    K g g m

    Initialize k cluster centersDo

    Assignment step: Assign each data point to its closestcluster center

    Re-estimation step: Re-compute cluster centers

    While (there are still changes in the cluster centers)

    Visualization at:

    http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html

    Issues

    http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.htmlhttp://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.htmlhttp://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.htmlhttp://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.htmlhttp://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html
  • 7/30/2019 Very Good Minng

    137/301

    Why is K-Means working: How does it find the cluster centers?

    Does it find an optimal clustering

    What are good starting points for the algorithm?

    What is the right number of cluster centers?

    How do we know it will terminate?

    Agglomerative Clustering

  • 7/30/2019 Very Good Minng

    138/301

    gg g

    Algorithm: Put each item in its own cluster (all singletons)

    Find all pairwise distances between clusters

    Merge the two closestclusters

    Repeat until everything is in one cluster

    Observations:

    Results in a hierarchical clustering

    Yields a clustering for each possible number ofclusters

    Greedy clustering: Result is not optimal for anycluster size

    Density-Based Clustering

  • 7/30/2019 Very Good Minng

    139/301

    y g

    A cluster is defined as a connected densecomponent.

    Density is defined in terms of number ofneighbors of a point.

    We can find clusters of arbitrary shape

    Market Basket Analysis

  • 7/30/2019 Very Good Minng

    140/301

    y

    Consider shopping cart filled withseveral items

    Market basket analysis tries to

    answer the following questions:Who makes purchases?

    What do customers buy together?

    In what order do customers purchaseitems?

    Market Basket Analysis

  • 7/30/2019 Very Good Minng

    141/301

    y

    Given:A database of

    customertransactions

    Each transaction isa set of items

    Example:Transaction withTID 111 containsitems {Pen, Ink,Milk, Juice}

    TID CID Date Item Qty111 201 5/1/99 Pen 2

    111 201 5/1/99 Ink 1

    111 201 5/1/99 Milk 3

    111 201 5/1/99 Juice 6

    112 105 6/3/99 Pen 1

    112 105 6/3/99 Ink 1

    112 105 6/3/99 Milk 1

    113 106 6/5/99 Pen 1

    113 106 6/5/99 Milk 1

    114 201 7/1/99 Pen 2

    114 201 7/1/99 Ink 2114 201 7/1/99 Juice 4

    Market Basket Analysis (Contd.)

  • 7/30/2019 Very Good Minng

    142/301

    y ( )Coocurrences

    80% of all customers purchase items X,Y and Z together.

    Association rules

    60% of all customers who purchase Xand Y also buy Z.

    Sequential patterns

    60% of customers who first buy X alsopurchase Y within three weeks.

    Confidence and Support

  • 7/30/2019 Very Good Minng

    143/301

    pp

    We prune the set of all possibleassociation rules using twointerestingness measures:

    Confidence of a rule:X Y has confidence c if P(Y|X) = c

    Support of a rule:X Y has support s if P(XY) = s

    We can also define

    Support of an itemset (acoocurrence) XY:

    Market Basket Analysis:Applications

  • 7/30/2019 Very Good Minng

    144/301

    pp

    Sample ApplicationsDirect marketing

    Fraud detection for medical insurance

    Floor/shelf planningWeb site layout

    Cross-selling

    Applications of Frequent Itemsets

  • 7/30/2019 Very Good Minng

    145/301

    pp q

    Market Basket Analysis

    Association Rules

    Classification (especially: text, rare

    classes)

    Seeds for construction of BayesianNetworks

    Web log analysis

    Collaborative filtering

    Association Rule Algorithms

  • 7/30/2019 Very Good Minng

    146/301

    g

    More abstract problem redux

    Breadth-first search

    Depth-first search

    Problem Redux

  • 7/30/2019 Very Good Minng

    147/301

    Abstract: A set of items {1,2,,k}

    A dabase of transactions(itemsets) D={T1, T2, ,Tn},Tj subset {1,2,,k}

    GOAL:

    Find all itemsets that appear inat least x transactions

    (appear in == are subsetsof)

    I subset T: T supports I

    For an itemset I, the number oftransactions it appears in is

    called the support of I.

    Concrete: I = {milk, bread, cheese,

    }

    D = {{milk,bread,cheese},{bread,cheese,juice}, }

    GOAL:

    Find all itemsets that appear

    in at least 1000transactions

    {milk,bread,cheese}supports {milk,bread}

    Problem Redux (Contd.)

  • 7/30/2019 Very Good Minng

    148/301

    Definitions: An itemset is frequent if it

    is a subset of at least xtransactions. (FI.)

    An itemset is maximallyfrequent if it is frequentand it does not have afrequent superset. (MFI.)

    GOAL: Given x, find allfrequent (maximally

    frequent) itemsets (to bestored in the FI (MFI)).

    Obvious relationship:MFI subset FI

    Example:D={ {1,2,3}, {1,2,3},

    {1,2,3}, {1,2,4} }

    Minimum support x = 3

    {1,2} is frequent{1,2,3} is maximal frequent

    Support({1,2}) = 4

    All maximal frequent

    itemsets: {1,2,3}

    Applications

  • 7/30/2019 Very Good Minng

    149/301

    Spatial association rules

    Web mining

    Market basket analysis

    User/customer profiling

    ExtenSuggestionssions: SequentialPatterns

  • 7/30/2019 Very Good Minng

    150/301

    In the Market Itemset Analysisreplace Milk, Pen, etc with names ofmedications and use the idea in

    Hospital Data mining new proposalThe idea of swaem intelligence add

    to it the extra analysis pf the

    inducyion rules in this set of slides.

    Kraft Foods: Direct Marketing

  • 7/30/2019 Very Good Minng

    151/301

    Kraft Foods: Direct Marketing

    Company maintains a large database of purchases by customers.

    Data mining1. Analysts identified associations among groups of products

    bought by particular segments of customers.

    2. Sent out 3 sets of coupons to various households.

    Better response rates: 50 % increase in sales for one itsproducts

    Continue to use of this approach

    Health Insurance Commission of Australia: Insurance Fraud

    Commission maintains a database of insurance claims,includinglaboratory tests ordered during the diagnosis of patients.

    Data mining

    1. Identified the practice of "up coding" to reflect moreexpensive tests than are necessary.

    2. Now monitors orders for lab tests.

    Commission expects to save US$1,000,000 / year byeliminating the practice of "up coding.

    HNC Software: Credit Card Fraud

  • 7/30/2019 Very Good Minng

    152/301

    Payment Fraud

    Large issuers of cards may lose

    $10 million / year due to fraud

    Difficult to identify the few transactions among thousands which

    reflect potential fraud

    Falcon software

    Mines data through neural networks

    Introduced in September 1992

    Models each cardholder's requested transaction against the customer's

    past spending history.

    processes several hundred requests per second

    compares current transaction with customer's history

    identifies the transactions most likely to be frauds

    enables bank to stop high-risk transactions before they are

    authorized

    Used by many retail banks: currently monitors

    160 million card accounts for fraud

    New Account Fraud

  • 7/30/2019 Very Good Minng

    153/301

    New Account Fraud

    Fraudulent applications for credit cards are growing at 50 %

    per year

    Falcon Sentry software

    Mines data through neural networks and a rule baseIntroduced in September 1992

    Checks information on applications against data from

    credit bureaus

    Allows card issuers to simultaneously:

    increase the proportion of applications received

    reduce the proportion of fraudulent applications

    authorized

    Quality Control

  • 7/30/2019 Very Good Minng

    154/301

    y

    IBM Microelectronics: Quality Control Analyzed manufacturing data on Dynamic Random Access Memory

    (DRAM) chips.

    Data mining

    1. Built predictive models of

    manufacturing yield (% non-defective)

    effects of production parameters on chip performance.

    2. Discovered critical factors behind

    production yield &

    product performance.3. Created a new design for the chip

    increased yield saved millions of dollars in direct

    manufacturing costs

    enhanced product performance by substantially lowering the

    memory cycle time

    Retail Sales

  • 7/30/2019 Very Good Minng

    155/301

    B & L Stores

    Belk and Leggett Stores =

    one of largest retail chains

    280 stores in southeast U.S.

    data warehouse contains 100s of gigabytes (billioncharacters) of data

    data mining to:

    increase sales

    reduce costs

    Selected DSS Agent from MicroStrategy, Inc.

    analyize merchandizing (patterns of sales)

    manage inventory

    Market Basket Analysis

  • 7/30/2019 Very Good Minng

    156/301

    DSS Agent

    uses intelligent agents data mining

    provides multiple functions

    recognizes sales patterns among stores

    discovers sales patterns by

    time of day day of year

    category of product

    etc.

    swiftly identifies trends & shifts in customer tastes

    performs Market Basket Analysis (MBA)

    analyzes Point-of-Sale or -Service (POS) data

    identifies relationships among products and/or services purchased

    E.g. A customer who buys Brand X slacks has a 35% chance of

    buying Brand Y shirts.

    Agent tool is also used by other Fortune 1000 firms

    average ROI > 300 %

    Case Based Reasoning

    (CBR)

  • 7/30/2019 Very Good Minng

    157/301

    (CBR)

    case A targetcase B

    General scheme for a case based reasoning (CBR) model. The target cas

    matched against similar precedents in the historical database, such as cas

    Case Based Reasoning (CBR)

  • 7/30/2019 Very Good Minng

    158/301

    Learning through the accumulation of experience

    Key issues

    Indexing:storing cases for quick, effective access of precedents

    Retrieval:accessing the appropriate precedent cases

    Advantages

    Explicit knowledge form recognizable to humans

    No need to re-code knowledge for computer processing

    Limitations

    Retrieving precedents based on superficial featuresE.g. Matching Indonesia with U.S. because both have similar population size

    Traditional approach ignores the issue of generalizing knowledge

    Genetic Algorithm

  • 7/30/2019 Very Good Minng

    159/301

    Generation of candidate solutions using the procedures of biologicalevolution.

    Procedure

    0. Initialize.Create a population of potential solutions ("organisms").

    1. Evaluate.Determine the level of "fitness" for each solution.

    2. Cull.Discard the poor solutions.

    3. Breed.a. Select 2 "fit" solutions to serve as parents.b. From the 2 parents, generate offspring.

    * Crossover:Cut the parents at random and switch the 2 halves.

    * Mutation:

    Randomly change the value in a parent solution.4. Repeat.

    Go back to Step 1 above.

    Genetic Algorithm (Cont.)

  • 7/30/2019 Very Good Minng

    160/301

    Advantages Applicable to a wide range of problem domains.

    Robustness:can obtain solutions even when the performance

    function is highly irregular or input data are noisy.

    Implicit parallelism:can search in many directions concurrently.

    Limitations

    Slow, like neural networks.But: computation can be distributed

    over multiple processors

    (unlike neural networks)

    Source: www.pathology.washington.edu

    Multistrategy Learning

  • 7/30/2019 Very Good Minng

    161/301

    Every technique has advantages & limitations

    Multistrategy approach

    Take advantage of the strengths of diverse techniques

    Circumvent the limitations of each methodology

    Types of Models

  • 7/30/2019 Very Good Minng

    162/301

    Prediction Models forPredicting and Classifying Regression algorithms

    (predict numericoutcome): neural

    networks, rule induction,CART (OLS regression,GLM)

    Classification algorithmpredict symbolicoutcome): CHAID, C5.0

    (discriminant analysis,logistic regression)

    Descriptive Models forGrouping and FindingAssociations

    Clustering/Grouping

    algorithms: K-means,Kohonen

    Association algorithms:

    apriori, GRI

  • 7/30/2019 Very Good Minng

    163/301

    Neural NetworksDescription

    Difficult interpretation

    Tends to overfit the data

    Extensive amount of training time

    A lot of data preparation

    Works with all data types

    R l I d ti

  • 7/30/2019 Very Good Minng

    164/301

    Rule Induction

    Description

    Intuitive output

    Handles all forms of numeric data,as well as non-numeric (symbolic)data

    C5 Algorithm a special case of ruleinduction

    Apriori

  • 7/30/2019 Very Good Minng

    165/301

    p

    Description Seeks association rules

    in datasetMarket basket analysis

    Sequence discovery

    Data Mining Is

  • 7/30/2019 Very Good Minng

    166/301

    The automated process of findingrelationships and patterns in storeddata

    It is different from the use of SQLqueries and other businessintelligence tools

    Data Mining Is

  • 7/30/2019 Very Good Minng

    167/301

    Motivated by business need, largeamounts of available data, andhumans limited cognitive processing

    abilitiesEnabled by data warehousing,

    parallel processing, and data mining

    algorithms

    Common Types of Informationfrom Data Mining

  • 7/30/2019 Very Good Minng

    168/301

    Associations -- identifies occurrencesthat are linked to a single event

    Sequences -- identifies events that

    are linked over timeClassification -- recognizes patterns

    that describe the group to which an

    item belongs

    Common Types of Informationfrom Data Mining

  • 7/30/2019 Very Good Minng

    169/301

    Clustering -- discovers differentgroupings within the data

    Forecasting -- estimates future

    values

    Commonly Used Data MiningTechniques

  • 7/30/2019 Very Good Minng

    170/301

    Artificial neural networksDecision trees

    Genetic algorithms

    Nearest neighbor method

    Rule induction

    The Current State of Data MiningTools

  • 7/30/2019 Very Good Minng

    171/301

    Many of the vendors are small companiesIBM and SAS have been in the market for

    some time, and more biggies aremoving into this market

    BI tools and RDMS products areincreasingly including basic data miningcapabilities

    Packaged data mining applications arebecoming common

    The Data Mining Process

  • 7/30/2019 Very Good Minng

    172/301

    Requires personnel with domain,data warehousing, and data miningexpertise

    Requires data selection, dataextraction, data cleansing, and datatransformation

    Most data mining tools work withhighly granular flat files

    Is an iterative and interactive

    rocess

    Why Data Mining

  • 7/30/2019 Very Good Minng

    173/301

    Credit ratings/targeted marketing:Given a database of 100,000 names, which persons are

    the least likely to default on their credit cards?

    Identify likely responders to sales promotions

    Fraud detection

    Which types of transactions are likely to be fraudulent,given the demographics and transactional history of aparticular customer?

    Customer relationship management:

    Which of my customers are likely to be the most loyal,and which are most likely to leave for a competitor? :

    Data Mining helps extract suchinformation

    Applications

  • 7/30/2019 Very Good Minng

    174/301

    Banking: loan/credit card approvalpredict good customers based on old customers

    Customer relationship management:identify those who are likely to leave for a competitor.

    Targeted marketing:identify likely responders to promotions

    Fraud detection: telecommunications,financial transactionsfrom an online stream of event identify fraudulent

    events

    Manufacturing and production:automatically adjust knobs when process parameter

    changes

    Applications (continued)

  • 7/30/2019 Very Good Minng

    175/301

    Medicine: disease outcome, effectivenessof treatments

    analyze patient disease history: findrelationship between diseases

    Molecular/Pharmaceutical: identify newdrugs

    Scientific data analysis:

    identify new galaxies by searching for subclusters

    Web site/store design and promotion:

    find affinity of visitor to pages and modify

    The KDD process

  • 7/30/2019 Very Good Minng

    176/301

    Problem fomulation

    Data collectionsubset data: sampling might hurt if highly skewed data

    feature selection: principal component analysis,heuristic search

    Pre-processing: cleaningname/address cleaning, different meanings (annual,

    yearly), duplicate removal, supplying missing values

    Transformation:

    map complex objects e.g. time series data to featurese.g. frequency

    Choosing mining task and mining method:

    Result evaluation and Visualization:

    Knowledge discovery is an iterative process

    Relationship with other fields

  • 7/30/2019 Very Good Minng

    177/301

    Overlaps with machine learning, statistics,artificial intelligence, databases,visualization but more stress on

    scalability of number of features and instancesstress on algorithms and architectures

    whereas foundations of methods andformulations provided by statistics and

    machine learning.automation for handling large, heterogeneous

    data

    Some basic operations

  • 7/30/2019 Very Good Minng

    178/301

    Predictive:Regression

    Classification

    Collaborative Filtering

    Descriptive:

    Clustering / similarity matching

    Association rules and variants

    Deviation detection

    Classification

  • 7/30/2019 Very Good Minng

    179/301

    Given old data about customers andpayments, predict new applicantsloan eligibility.

    AgeSalary

    Profession

    LocationCustomer type

    Previous customers Classifier Decision rulesSalary > 5 L

    Prof. = Exec

    New applicants data

    Good/bad

    Classification methods

  • 7/30/2019 Very Good Minng

    180/301

    Goal: Predict class Ci = f(x1, x2, ..Xn)

    Regression: (linear or any other

    polynomial)a*x1 + b*x2 + c = Ci.

    Nearest neighour

    Decision tree classifier: divide decisionspace into piecewise constant regions.

    Probabilistic/generative models

    Neural networks: partition by non-

    Nearest neighbor

  • 7/30/2019 Very Good Minng

    181/301

    Define proximity between instances,find neighbors of new instance andassign majority class

    Case based reasoning: whenattributes are more complicated thanreal-valued. Cons

    Slow during application.

    No feature selection.

    Notion of proximity vague

    Pros

    + Fast training

    Clustering

  • 7/30/2019 Very Good Minng

    182/301

    Unsupervised learning when old data withclass labels not available e.g. whenintroducing a new product.

    Group/cluster existing customers based ontime series of payment history such thatsimilar customers in same cluster.

    Key requirement: Need a good measure ofsimilarity between instances.

    Identify micro-markets and develop

    policies for each

    Applications

  • 7/30/2019 Very Good Minng

    183/301

    Customer segmentation e.g. for targetedmarketing

    Group/cluster existing customers based ontime series of payment history such that

    similar customers in same cluster.Identify micro-markets and develop policies

    for each

    Collaborative filtering:

    group based on common items purchased

    Text clustering

    Compression

    Distance functions

  • 7/30/2019 Very Good Minng

    184/301

    Numeric data: euclidean, manhattandistances

    Categorical data: 0/1 to indicatepresence/absence followed by

    Hamming distance (# dissimilarity)

    Jaccard coefficients: #similarity in 1s/(# of1s)

    data dependent measures: similarity of A andB depends on co-occurance with C.

    Combined numeric and categorical data:

    weighted normalized distance:

    Clustering methods

  • 7/30/2019 Very Good Minng

    185/301

    Hierarchical clusteringagglomerative Vs divisive

    single link Vs complete link

    Partitional clusteringdistance-based: K-means

    model-based: EM

    density-based:

    Partitional methods: K-means

  • 7/30/2019 Very Good Minng

    186/301

    Criteria: minimize sum of square ofdistanceBetween each point and centroid of the

    cluster.

    Between each pair of points in thecluster

    Algorithm:

    Select initial partition with K clusters:random, first K, K separated points

    Repeat until stabilization:

    Assign each point to closest cluster

    center

    Collaborative Filtering

  • 7/30/2019 Very Good Minng

    187/301

    Given database of user preferences,predict preference of new user

    Example: predict what new movies you willlike based on

    your past preferencesothers with similar past preferences

    their preferences for the new movies

    Example: predict what books/CDs a personmay want to buy(and suggest it, or give discounts to

    tempt customer)

    Association rules

    T

  • 7/30/2019 Very Good Minng

    188/301

    Given set T of groups of items

    Example: set of item setspurchased

    Goal: find all rules on itemsetsof the form a-->b such that

    support of a and b > userthreshold s

    conditional probability (confidence)of b given a > user threshold c

    Example: Milk --> bread

    P h f d t A >

    Milk, cerealTea, milk

    Tea, rice, bread

    cereal

    Prevalent Interesting

  • 7/30/2019 Very Good Minng

    189/301

    Analysts alreadyknow aboutprevalent rules

    Interesting rulesare those thatdeviate from priorexpectation

    Minings payoff isin findingsurprisingphenomena

    1995

    1998

    Milk andcereal sell

    together!

    Zzzz...Milk and

    cereal sell

    together!

    Applications of fast itemsetcounting

  • 7/30/2019 Very Good Minng

    190/301

    Find correlated events:Applications in medicine: find

    redundant tests

    Cross selling in retail, bankingImprove predictive capability of

    classifiers that assume attribute

    independence New similarity measures of

    categorical attributes [Mannila et al,

    Application Areas

  • 7/30/2019 Very Good Minng

    191/301

    Industry Application

    Finance Credit Card Analysis

    Insurance Claims, Fraud Analysis

    Telecommunication Call record analysisTransport Logistics management

    Consumer goods promotion analysis

    Data Service providers Value added dataUtilities Power usage analysis

    Usage scenarios

  • 7/30/2019 Very Good Minng

    192/301

    Data warehouse mining:assimilate data from operational sources

    mine static data

    Mining log data

    Continuous mining: example in processcontrol

    Stages in mining:

    data selection pre-processing:cleaning transformation mining result evaluation visualization

    Mining market

  • 7/30/2019 Very Good Minng

    193/301

    Around 20 to 30 mining tool vendorsMajor tool players:Clementine,

    IBMs Intelligent Miner,

    SGIs MineSet,SASs Enterprise Miner.

    All pretty much the same set of tools

    Many embedded products:fraud detection:

    electronic commerce applications,

    health care,

    customer relationship management: Epiphany

    Vertical integration:Mining on the web

  • 7/30/2019 Very Good Minng

    194/301

    Web log analysis for site design:what are popular pages,

    what links are hard to find.

    Electronic stores sales enhancements:recommendations, advertisement:

    Collaborative filtering: Net perception,Wisewire

    Inventory control: what was a shopperlooking for and could not find..

    State of art in mining OLAPintegration

  • 7/30/2019 Very Good Minng

    195/301

    Decision trees [Information discovery,Cognos]

    find factors influencing high profits

    Clustering [Pilot software]segment customers to define hierarchy on that

    dimension

    Time series analysis: [Seagates Holos]

    Query for various shapes along time: eg. spikes,outliers

    Multi-level Associations [Han et al.]

    fi d i ti b t b f di i

    Data Mining in Use

  • 7/30/2019 Very Good Minng

    196/301

    The US Government uses Data Mining totrack fraud

    A Supermarket becomes an information

    brokerBasketball teams use it to track game

    strategy

    Cross Selling

    Target Marketing

    Holding on to Good Customers

    Weeding out Bad Customers

    Some success stories

  • 7/30/2019 Very Good Minng

    197/301

    Network intrusion detection using a combinationof sequential rule discovery and classificationtree on 4 GB DARPA dataWon over (manual) knowledge engineering approach

    http://www.cs.columbia.edu/~sal/JAM/PROJECT/

    provides good detailed description of the entire processMajor US bank: customer attrition prediction

    First segment customers based on financial behavior:found 3 segments

    Build attrition models for each of the 3 segments

    40-50% of attritions were predicted == factor of 18increase

    Targeted credit marketing: major US banksfind customer segments based on 13 months credit

    balances

    What is KnowledgeSeeker?

  • 7/30/2019 Very Good Minng

    198/301

    Data Mining 199

    Produced by ANGOSS Software Corporation,who focus solely on data mining software.

    Offer training and consulting services

    Produce data mining add-ins which acceptsdata from all major databases

    Works with popular query and reporting,

    spreadsheet, statistical and OLAP & ROLAPtools.

    Major Competitors

  • 7/30/2019 Very Good Minng

    199/301

    Data Mining 200

    Company Software

    Clementine 6.0

    Enterprise Miner 3.0

    Intelligent Miner

    Major Competitors

    http://www.ibm.com/http://localhost/var/www/apps/conversion/tmp/scratch_1/
  • 7/30/2019 Very Good Minng

    200/301

    Data Mining 201

    Company Software

    Mineset 3.1

    Darwin

    Scenario

    Current Applications

    http://www.cognos.com/http://www.oracle.com/http://localhost/var/www/apps/conversion/tmp/scratch_1/
  • 7/30/2019 Very Good Minng

    201/301

    Data Mining 202

    ManufacturingUsed by the R.R. Donnelly & Sons commercial

    printing company to improve process control, cutcosts and increase productivity.

    Used extensively by Hewlett Packard in theirUnited States manufacturing plants as a processcontrol tool both to analyze factors impactingproduct quality as well as to generate rules for

    production control systems.

    Current Applications

    http://www.hp.com/Redirect/gw/useng_companyinfo/logo/=http://welcome.hp.com/country/us/eng/welcome.htm
  • 7/30/2019 Very Good Minng

    202/301

    Data Mining 203

    AuditingUsed by the IRS to combat fraud,

    reduce risk, and increase collectionrates.

    Finance

    Used by the Canadian Imperial Bankof Commerce (CIBC) to createmodels for fraud detection and risk

    management.

    Current Applications

    CRM

  • 7/30/2019 Very Good Minng

    203/301

    Data Mining 204

    CRM

    Telephony

    Used by US West to reduce churning andincrease customer loyalty for a new voice

    messaging technology.

    Current Applications

    Marketing

  • 7/30/2019 Very Good Minng

    204/301

    Data Mining 205

    Marketing

    Used by the Washington Post toimprove their direct mail targetingand to conduct survey analysis.

    Health Care

    Used by the Oxford TransplantCenter to discover factors affectingtransplant survival rates.

    Used by the University of Rochester

    Cancer Center to study the effect ofanxiety on chemotherapy-relatednausea.

    More Customers

    http://washpost.com/http://www.aig.com/http://www.ameritrade.com/http://www.chase.com/
  • 7/30/2019 Very Good Minng

    205/301

    Data Mining 206

    Questions

    1. What percentage of people in the test group have high blood pressure

    http://www.glaxowellcome.com/http://www.aig.com/http://www.sbc.com/http://www.microsoft.com/http://www.ameritrade.com/http://www.chase.com/http://www.pacbell.com/http://www.generalelectric.com/http://www.texaco.com/http://www.pfizer.com/http://www.bankofamerica.com/http://www.allstate.com/
  • 7/30/2019 Very Good Minng

    206/301

    Data Mining 207

    p g p p g p g p

    with these characteristics: 66-year-old male regular smoker that haslow to moderate salt consumption?

    2. Do the risk levels change for a male with the same characteristics whoquit smoking? What are the percentages?

    3. If you are a 2% milk drinker, how many factors are still interesting?

    4. Knowing that salt consumption and smoking habits are interestingfactors, which one has a stronger correlation to blood pressure levels?

    5. Grow an automatic tree. Look to see if gender is an interesting factorfor 55-year-old regular smoker who does not each cheese?

    Association

  • 7/30/2019 Very Good Minng

    207/301

    Classic market-basket analysis, which treats thepurchase of a number of items (for example, the

    contents of a shopping basket) as a single transaction.

    This information can be used to adjust inventories,

    modify floor or shelf layouts, or introduce targetedpromotional activities to increase overall sales or

    move specific products.

    Example : 80 percent of all transactions in whichbeer was purchased also included potato chips.

    Sequence-based analysis

  • 7/30/2019 Very Good Minng

    208/301

    Traditional market-basket analysis deals witha collection of items as part of a point-in-time

    transaction.

    to identify a typical set of purchases that mightpredict the subsequent purchase of a specific

    item.

    Clustering

  • 7/30/2019 Very Good Minng

    209/301

    Clustering approach address segmentationproblems.

    These approaches assign records with a largenumber of attributes into a relatively small set of

    groups or "segments."Example : Buying habits of multiple population

    segments might be compared to determine whichsegments to target for a new sales campaign.

    Classification

  • 7/30/2019 Very Good Minng

    210/301

    Most commonly applied data miningtechnique

    Algorithm uses preclassified examples todetermine the set of parameters required forproper discrimination.

    Example : A classifier derived from theClassification approach is capable of

    identifying risky loans, could be used to aid inthe decision of whether to grant a loan to anindividual.

    Issues of Data Mining

  • 7/30/2019 Very Good Minng

    211/301

    Present-day tools are strong but requiresignificant expertise to implement effectively.

    Issues of Data Mining

    Susceptibility to "dirty" or irrelevant data.Inability to "explain" results in human terms.

    Issues

  • 7/30/2019 Very Good Minng

    212/301

    susceptibility to "dirty" or irrelevant dataData mining tools of today simply take everything

    they are given as factual and draw the resulting

    conclusions.

    Users must take the necessary precautions to

    ensure that the data being analyzed is "clean."

    Issues, cont

  • 7/30/2019 Very Good Minng

    213/301

    inability to "explain" results in human termsMany of the tools employed in data mining

    analysis use complex mathematical algorithms that

    are not easily mapped into human terms.

    what good does the information do if you dont

    understand it?

    Comparison with reporting, BI andOLAP

  • 7/30/2019 Very Good Minng

    214/301

    Reporting

    Simplerelationships

    Choose therelevant factors

    Examine alldetails

    (Also applies tovisualisation &simple statistics)

    Data MiningComplex

    relationships

    Automatically find

    the relevant factorsShow only relevant

    details

    Prediction

    Comparison with Statistics

  • 7/30/2019 Very Good Minng

    215/301

    Statistical analysisMainly about

    hypothesis testing

    Focussed on

    precision

    Data miningMainly about

    hypothesisgeneration

    Focussed ondeployment

    Example: data mining and customerprocesses

  • 7/30/2019 Very Good Minng

    216/301

    Insight: Who are my customers andwhy do they behave the way theydo?

    Prediction: Who is a good prospect,for what product, who is at risk,what is the next thing to offer?

    Uses: Targeted marketing, mail-shots, call-centres, adaptive web-sites

    Example: data mining and frauddetection

  • 7/30/2019 Very Good Minng

    217/301

    Insight: How can (specificmethod of) fraud berecognised? What constitute

    normal, abnormal andsuspicious events?

    Prediction: Recognisesimilarity to previous frauds

    how similar?Spot abnormal events howsuspicious?

    Example: data mining anddiagnosing cancer

  • 7/30/2019 Very Good Minng

    218/301

    Complex data from geneticsChallenging data mining problem

    Find patterns of gene activation

    indicating different diseases / stagesChanged the way I think about

    cancerOncologist from Chicago Childrens

    Memorial Hospital

    Example: data mining and policing

  • 7/30/2019 Very Good Minng

    219/301

    Knowing the patterns helps planeffective crime prevention

    Crime hot-spots understood better

    Sift through mountains of crimereports

    Identify crime series

    Other people save money usingdata mining we save lives.Policeforce homicide specialist and data miner

    Data mining tools:Clementine and its philosophy

  • 7/30/2019 Very Good Minng

    220/301

    How to do data mining

  • 7/30/2019 Very Good Minng

    221/301

    Lots of data mining operationsHow do you glue them together to

    solve a problem?

    How do we actually do data mining?Methodology

    Not just the right way, but any way

    Myths about Data Mining (1)Data, Process and Tech

  • 7/30/2019 Very Good Minng

    222/301

    Data mining is all about

    massive data

    It can be, but some importantdatasets are very small, and

    sampling is often appropriate

    Data mining is atechnical process

    Business analysts perform

    data mining every dayIt is a business process

    Data mining is all

    about algorithms

    Algorithms are a key toolBut data mining is done by

    people, not by algorithms

    Data mining is all

    about predictive accuracy

    It's about usefulnessAccuracy is only a small

    component

    Myths about Data Mining (2)Data Quality

  • 7/30/2019 Very Good Minng

    223/301

    Data mining only works

    with clean data

    Cleaning the data is partof the data mining process

    Need not be clean initially

    Data mining only works

    with complete data

    Data mining works withwhatever data you have.Complete is good,

    incomplete is also ok.

    Data mining only workswith correct