data mining and data visualization som 485 fall 2007

49
Data Mining Data Mining and and Data Visualization Data Visualization SOM 485 Fall 2007

Post on 20-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Data Mining Data Mining and and

Data VisualizationData Visualization

SOM 485

Fall 2007

Getting StartedGetting Started

What is Data Mining?What is Data Mining? Online Analytical ProcessingOnline Analytical Processing Data Mining TechniquesData Mining Techniques Market Basket AnalysisMarket Basket Analysis Limitations and Challenges to Data MiningLimitations and Challenges to Data Mining Data VisualizationData Visualization Siftware TechnologiesSiftware Technologies

What is Data Mining (DM)?What is Data Mining (DM)?

Group of activities used to find different patterns Group of activities used to find different patterns in datain data

Information provided through a Data WarehouseInformation provided through a Data Warehouse Provides valuable information for different types Provides valuable information for different types

of research.of research.

Applications of DMApplications of DM

Customer Relationship Customer Relationship ManagementManagement (CRM) (CRM)

software is an software is an application that can application that can benefit DMbenefit DM

Activities of CRMActivities of CRMOne-to-One MarketingOne-to-One MarketingSales Force AutomationSales Force AutomationSales Campaign ManagementSales Campaign ManagementMarketing EncyclopediaMarketing EncyclopediaCall Center AutomationCall Center Automation

Verification of DMVerification of DM

Requires a lot of prior knowledge on the Requires a lot of prior knowledge on the decision maker’s partdecision maker’s part

Used mainly in casinosUsed mainly in casinos i.e. Can determine if a new customer is a high roller, a souvenir i.e. Can determine if a new customer is a high roller, a souvenir

buyer, a ticket purchaser, etc.buyer, a ticket purchaser, etc.

Uses Uses SiftwareSiftware to help discover new to help discover new patterns of customer spending habitspatterns of customer spending habits Allows effective targeting to a specific group of customersAllows effective targeting to a specific group of customers

Online Analytical ProcessingOnline Analytical Processing

Online Analytical Processing (OLAP) was Online Analytical Processing (OLAP) was introduced by E. F. Codd in 1993introduced by E. F. Codd in 1993

OLAP: computer process that allows a OLAP: computer process that allows a user to extract data from different view user to extract data from different view pointspoints

Scientific and Academic organizations Scientific and Academic organizations store about 1 terabyte (1 trillion bytes) of store about 1 terabyte (1 trillion bytes) of new data each day.new data each day.

OLAP continue…OLAP continue…

Codd’s 12 Rules for OLAPCodd’s 12 Rules for OLAP1.1. Multidimensional ViewMultidimensional View2.2. Transparent to the UserTransparent to the User3.3. AccessibleAccessible4.4. Consistent ReportingConsistent Reporting5.5. Client-Server architectureClient-Server architecture6.6. Generic DimensionalityGeneric Dimensionality7.7. Dynamic Sparse Matrix HandlingDynamic Sparse Matrix Handling8.8. Multi-user SupportMulti-user Support9.9. Cross-Dimensional OperationsCross-Dimensional Operations10.10. Intuitive Data ManipulationIntuitive Data Manipulation11.11. Flexible ReportingFlexible Reporting12.12. Infinite Levels of Dimension and Infinite Levels of Dimension and

AggregationAggregation

OLAP: MOLAP & ROLAPOLAP: MOLAP & ROLAP

OLAP data is stored in a OLAP data is stored in a Multidimensional Multidimensional DatabaseDatabase (MBD) (MBD)

MOLAPMOLAP: OLAP application that accesses : OLAP application that accesses data from a multidimensional databasedata from a multidimensional database

MBD are frequently created using input MBD are frequently created using input from an existing from an existing Relational DatabaseRelational Database

ROLAP: ROLAP: Relational Database server that Relational Database server that can work with SQL for portability and can work with SQL for portability and scalability.scalability.

DATA MINING DATA MINING TECHNIQUESTECHNIQUES

FOUR MAJOR FOUR MAJOR CATEGORIESCATEGORIES

1.1. ClassificationClassification

2.2. AssociationAssociation

3.3. SequenceSequence

4.4. ClusterCluster

CLASSIFICATIONCLASSIFICATION

- Mining processes Mining processes intended to discover intended to discover rules that define rules that define whether an item whether an item belongs to a particular belongs to a particular class of dataclass of data

- Two Sub-processes: Two Sub-processes:

1) Building a Model 1) Building a Model

2) Predicting 2) Predicting ClassificationsClassifications

ASSOCIATIONASSOCIATION

Techniques that employ association Techniques that employ association search all details from operational systems search all details from operational systems for patterns with a high probability of for patterns with a high probability of repetitionrepetition

Example: Market Basket Analysis Example: Market Basket Analysis

SEQUENCESEQUENCE

Time series analysis methods relate Time series analysis methods relate events in time based on a series of events in time based on a series of preceding eventspreceding events

Through analysis, various hidden trends, Through analysis, various hidden trends, often highly predictive of future events, often highly predictive of future events, can be discovered.can be discovered.

Example: Mail IndustryExample: Mail Industry

CLUSTERCLUSTER

To create partitions so that all members of To create partitions so that all members of each set are similar according to some each set are similar according to some metricmetric

Simply a set of objects grouped together Simply a set of objects grouped together by virtue of their similarity or proximity to by virtue of their similarity or proximity to each othereach other

Example: Credit Card TransactionsExample: Credit Card Transactions

DATA MINING DATA MINING TECHNOLOGIESTECHNOLOGIES

Providing new answers to old questionsProviding new answers to old questions Developing new knowledge and understanding Developing new knowledge and understanding

through discoverythrough discovery Statistical Analysis – statistically evaluating Statistical Analysis – statistically evaluating

products and making a decision based on logical products and making a decision based on logical reasoningreasoning

Neural Networks – attempts to mirror the way Neural Networks – attempts to mirror the way the human brain works in recognizing patterns the human brain works in recognizing patterns by developing mathematical structures with the by developing mathematical structures with the ability to learnability to learn

DATA MINING DATA MINING TECHNOLOGIES CONT’TECHNOLOGIES CONT’

Genetic Algorithms and Fuzzy Logic – machine Genetic Algorithms and Fuzzy Logic – machine learning techniques derive meaning from learning techniques derive meaning from complicated and imprecise data and can extract complicated and imprecise data and can extract patterns from and detect trends within the data patterns from and detect trends within the data that are far too complex to be noticed by that are far too complex to be noticed by humanshumans

Decision Trees – assists in data mining Decision Trees – assists in data mining applications by the classification of items or applications by the classification of items or events contained within the warehouse events contained within the warehouse

NEW APPLICATIONS FOR NEW APPLICATIONS FOR DATA MININGDATA MINING

Two new categories of applicationsTwo new categories of applications

1) Text Mining – summarizes, navigates, and 1) Text Mining – summarizes, navigates, and clusters documents contained in a databaseclusters documents contained in a database

2) Web Mining – integrates data and text mining 2) Web Mining – integrates data and text mining within a Web site; enhances the Web site with within a Web site; enhances the Web site with intelligent behavior, such as suggesting related intelligent behavior, such as suggesting related links or recommending new products to the links or recommending new products to the consumerconsumer

Market Basket AnalysisMarket Basket Analysis

Market Basket AnalysisMarket Basket Analysis

Market Basket AnalysisMarket Basket Analysis

• Market Basket Analysis is an algorithm that examines a long list of transactions in order to determine which items are most frequently purchased together.

• It takes its name from the idea of a person in a supermarket throwing all of their items into a shopping cart (a "market basket").

• Market basket analysis one of the most common and useful types of data analysis for marketing.

• With the data gathered from MBA, marketers can group products that customers like and group them together.

• Market basket analysis can improve the effectiveness of marketing and sales tactics.

Benefits of Market Basket Analysis:Benefits of Market Basket Analysis:

•A good indication of consumer behavior

•Increase in sales

•Improves customer satisfaction

•Tracks what types of products interest consumer and finds relative alternative ones to introduce to the consumer.

ASSOCIATION RULES for MBAASSOCIATION RULES for MBA

• Support

• Confidence

• Lift

•Method

Association rules- are a common undirected data mining technique and complement market basket analysis.

These rules are unidirectional

Left-hand side rule IMPLIES Right-hand side rule

ex. Pasta IMPLIES Wine, but Wine IMPLIES Pasta may not hold

40% of transactions that contain Pasta also contain Wine. 4% of transaction contain both of these items.

Support- % measure of baskets where the association rule is true between the Left-hand side & the Right-hand side.

ex. 4% of transactions contain both

Confidence- Probability that the Right-hand side item is present once the Left-hand side item is present.

ex. 40% of transactions that contain Pasta… p=.40

Lift- compares the likelihood of finding the right-hand side item in any random basket. Measures how well and associative rules performs by comparing how well an item can sell without the other item (improvement).

MethodMethod

Frozen Pizza

Milk Cola Potato Chips Pretzels

Frozen Pizza 2 1 2 0 0

Milk 1 3 1 1 1

Cola 2 1 3 0 1

Potato Chips

0 1 0 1 0

Pretzels 0 1 1 0 2

Market Basket AnalysisMarket Basket Analysis

Market Basket analysis- determines what products customers purchase together

Limits to Market Basket AnalysisLimits to Market Basket Analysis

• A large number of data is req. to obtain meaningful data, but data’s accuracy is compromised if all the products don’t occur w/in similar frequency.

• ex. Milk sells almost every transaction, but Elmer’s glue sells sporadically, its not effective to put them in same basket analysis.

• Sometimes presents results that are actually due to the success of previous market campaigns.• ex. Discounted price of cola with purchase of pizza.

Using Data from MBAUsing Data from MBA

Once information has been gathered about different Once information has been gathered about different items and how they sell with respect to other items, items and how they sell with respect to other items, a store may want to change their layout of items to a store may want to change their layout of items to improve their profits.improve their profits.

ex. Lunchboxes and School Suppliesex. Lunchboxes and School Supplies

For business without an actual storefront, they may want For business without an actual storefront, they may want to offer promotions for products that sell together-to offer promotions for products that sell together-increasing sales.increasing sales.

MARKET BASKET ANALYSIS In a Nutshell

Current Limitations and Current Limitations and Challenges to Data MiningChallenges to Data Mining

Current Limitations & Challenges to Current Limitations & Challenges to Data MiningData Mining

New and underdeveloped fieldNew and underdeveloped field

Identification of missing informationIdentification of missing information Most companies run legacy systemsMost companies run legacy systems Not DW (data warehouse) friendlyNot DW (data warehouse) friendly DW designers have to convert existing ODSs DW designers have to convert existing ODSs

(operational data stores) to homogenous form (operational data stores) to homogenous form of DWof DW

Current Limitations & Challenges to Current Limitations & Challenges to Data MiningData Mining

Not all knowledge about application Not all knowledge about application domains are present in the datadomains are present in the data

ODSs are normally limited to those ODSs are normally limited to those needed by the operational application needed by the operational application associated with that DBassociated with that DB

Data warehouse designers need to include Data warehouse designers need to include mechanisms for “inventorying” datamechanisms for “inventorying” data

Data noise & missing valuesData noise & missing values

Most operational databases contain data Most operational databases contain data errors in their values and/or classificationerrors in their values and/or classification Errors lead to misclassificationErrors lead to misclassification

Future data mining systems must incorporate Future data mining systems must incorporate more sophisticated mechanisms for treating more sophisticated mechanisms for treating “noisy data”“noisy data” Bayesian technique Bayesian technique – a statistical technique– a statistical technique

Large Databases & high Large Databases & high dimensionalitydimensionality

Databases are large & dynamicDatabases are large & dynamic Contents are always changingContents are always changing

Data patterns must be constantly updatedData patterns must be constantly updated

New discovery applications have to portion New discovery applications have to portion problems into smaller chunks of manageable problems into smaller chunks of manageable data without losing any essential attributes of data without losing any essential attributes of the datathe data

Data VisualizationData Visualization

Process by which numerical data are Process by which numerical data are converted into meaningful 3-D imagesconverted into meaningful 3-D images ExampleExample

Intended to analyze complex dataIntended to analyze complex data

Data from: satellite photos, sonar Data from: satellite photos, sonar measurements, surveys, or computer measurements, surveys, or computer simulationssimulations

History of Data VisualizationHistory of Data Visualization

Originated from statistics and scienceOriginated from statistics and science Example of 2-DExample of 2-D

Advancement credited to NCSAAdvancement credited to NCSA National Center for Supercomputing ApplicatiNational Center for Supercomputing Applicati

onsons

Newest developments by Newest developments by Xerox PARC Xerox PARC in in virtual realityvirtual reality

Human Visual PerceptionHuman Visual Perception

Human visual cortex dominates our Human visual cortex dominates our perceptionperception

Accelerates the identification of hidden Accelerates the identification of hidden patterns in datapatterns in data ““A picture is worth a thousand words”A picture is worth a thousand words”

Geographical Information Systems Geographical Information Systems (GIS)(GIS)

A special-purpose DB which common spatial A special-purpose DB which common spatial coordinate system is primary means of coordinate system is primary means of referencereference

Requires:Requires:1.1. Data input Data input 2.2. Data storage, retrieval, and queryData storage, retrieval, and query3.3. Data transformation, analysis, and modelingData transformation, analysis, and modeling4.4. Data reportingData reporting

Integrates info. and aids in Integrates info. and aids in decision makingdecision making

GIS continuedGIS continued

Spatial Data – elements stored in map Spatial Data – elements stored in map formform

• Contain three basic components:Contain three basic components:1.1. PointsPoints

2.2. LinesLines

3.3. PolygonsPolygons

Attribute Data – describes spatial dataAttribute Data – describes spatial data Example of GISExample of GIS

Applications of Data Visualization Applications of Data Visualization TechniquesTechniques

Retail BankingRetail Banking GovernmentGovernment InsuranceInsurance Health Care and MedicineHealth Care and Medicine TelecommunicationsTelecommunications TransportationTransportation Capital MarketsCapital Markets Asset ManagementAsset Management

Siftware TechnologiesSiftware Technologies

Siftware TechnologiesSiftware Technologies

IBMIBM InformixInformix Red BrickRed Brick DB2DB2

OracleOracle Silicon GraphicsSilicon Graphics SybaseSybase

Offers several Data Mining solutions, depending Offers several Data Mining solutions, depending on users need.on users need.

IBM Information Warehouse SolutionsIBM Information Warehouse Solutions

IBM VisualizerIBM Visualizer

Red BrickRed Brick

InformixInformix

Three-tier modelThree-tier model Tier 1: “Client” presentation layerTier 1: “Client” presentation layer

Tier 2: Hewlett-Packard hardwareTier 2: Hewlett-Packard hardware

Tier 3: Data layer INFORMIX –OnLine Tier 3: Data layer INFORMIX –OnLine databasedatabase

Sybase Warehouse WORKSSybase Warehouse WORKS Assemble data from may sourcesAssemble data from may sources

Transform data for a consistent and understandable Transform data for a consistent and understandable viewview

Distribute data where neededDistribute data where needed

Provide high-speed access to the dataProvide high-speed access to the data

Leading company for large-scale data miningLeading company for large-scale data mining

Data spread across mutliple databasesData spread across mutliple databases

Data spread across processors for faster Data spread across processors for faster queriesqueries

Discover new patterns and trends that may not Discover new patterns and trends that may not be realized using traditional SQLbe realized using traditional SQL

Three-dimensional VisualizationThree-dimensional Visualization

Visual models can save days and even months Visual models can save days and even months from the review processfrom the review process

ReviewReview

Data mining (DM)Data mining (DM)

Techniques used to mine dataTechniques used to mine data

Market Basket Analysis: The King of DM Market Basket Analysis: The King of DM AlgorithmsAlgorithms

Review continued…..Review continued…..

Current Limitations and Challenges to Current Limitations and Challenges to Data MiningData Mining

Data VisualizationData Visualization

Siftware TechnologiesSiftware Technologies