data mining and data visualization som 485 fall 2007
Post on 20-Dec-2015
212 views
TRANSCRIPT
Getting StartedGetting Started
What is Data Mining?What is Data Mining? Online Analytical ProcessingOnline Analytical Processing Data Mining TechniquesData Mining Techniques Market Basket AnalysisMarket Basket Analysis Limitations and Challenges to Data MiningLimitations and Challenges to Data Mining Data VisualizationData Visualization Siftware TechnologiesSiftware Technologies
What is Data Mining (DM)?What is Data Mining (DM)?
Group of activities used to find different patterns Group of activities used to find different patterns in datain data
Information provided through a Data WarehouseInformation provided through a Data Warehouse Provides valuable information for different types Provides valuable information for different types
of research.of research.
Applications of DMApplications of DM
Customer Relationship Customer Relationship ManagementManagement (CRM) (CRM)
software is an software is an application that can application that can benefit DMbenefit DM
Activities of CRMActivities of CRMOne-to-One MarketingOne-to-One MarketingSales Force AutomationSales Force AutomationSales Campaign ManagementSales Campaign ManagementMarketing EncyclopediaMarketing EncyclopediaCall Center AutomationCall Center Automation
Verification of DMVerification of DM
Requires a lot of prior knowledge on the Requires a lot of prior knowledge on the decision maker’s partdecision maker’s part
Used mainly in casinosUsed mainly in casinos i.e. Can determine if a new customer is a high roller, a souvenir i.e. Can determine if a new customer is a high roller, a souvenir
buyer, a ticket purchaser, etc.buyer, a ticket purchaser, etc.
Uses Uses SiftwareSiftware to help discover new to help discover new patterns of customer spending habitspatterns of customer spending habits Allows effective targeting to a specific group of customersAllows effective targeting to a specific group of customers
Online Analytical ProcessingOnline Analytical Processing
Online Analytical Processing (OLAP) was Online Analytical Processing (OLAP) was introduced by E. F. Codd in 1993introduced by E. F. Codd in 1993
OLAP: computer process that allows a OLAP: computer process that allows a user to extract data from different view user to extract data from different view pointspoints
Scientific and Academic organizations Scientific and Academic organizations store about 1 terabyte (1 trillion bytes) of store about 1 terabyte (1 trillion bytes) of new data each day.new data each day.
OLAP continue…OLAP continue…
Codd’s 12 Rules for OLAPCodd’s 12 Rules for OLAP1.1. Multidimensional ViewMultidimensional View2.2. Transparent to the UserTransparent to the User3.3. AccessibleAccessible4.4. Consistent ReportingConsistent Reporting5.5. Client-Server architectureClient-Server architecture6.6. Generic DimensionalityGeneric Dimensionality7.7. Dynamic Sparse Matrix HandlingDynamic Sparse Matrix Handling8.8. Multi-user SupportMulti-user Support9.9. Cross-Dimensional OperationsCross-Dimensional Operations10.10. Intuitive Data ManipulationIntuitive Data Manipulation11.11. Flexible ReportingFlexible Reporting12.12. Infinite Levels of Dimension and Infinite Levels of Dimension and
AggregationAggregation
OLAP: MOLAP & ROLAPOLAP: MOLAP & ROLAP
OLAP data is stored in a OLAP data is stored in a Multidimensional Multidimensional DatabaseDatabase (MBD) (MBD)
MOLAPMOLAP: OLAP application that accesses : OLAP application that accesses data from a multidimensional databasedata from a multidimensional database
MBD are frequently created using input MBD are frequently created using input from an existing from an existing Relational DatabaseRelational Database
ROLAP: ROLAP: Relational Database server that Relational Database server that can work with SQL for portability and can work with SQL for portability and scalability.scalability.
FOUR MAJOR FOUR MAJOR CATEGORIESCATEGORIES
1.1. ClassificationClassification
2.2. AssociationAssociation
3.3. SequenceSequence
4.4. ClusterCluster
CLASSIFICATIONCLASSIFICATION
- Mining processes Mining processes intended to discover intended to discover rules that define rules that define whether an item whether an item belongs to a particular belongs to a particular class of dataclass of data
- Two Sub-processes: Two Sub-processes:
1) Building a Model 1) Building a Model
2) Predicting 2) Predicting ClassificationsClassifications
ASSOCIATIONASSOCIATION
Techniques that employ association Techniques that employ association search all details from operational systems search all details from operational systems for patterns with a high probability of for patterns with a high probability of repetitionrepetition
Example: Market Basket Analysis Example: Market Basket Analysis
SEQUENCESEQUENCE
Time series analysis methods relate Time series analysis methods relate events in time based on a series of events in time based on a series of preceding eventspreceding events
Through analysis, various hidden trends, Through analysis, various hidden trends, often highly predictive of future events, often highly predictive of future events, can be discovered.can be discovered.
Example: Mail IndustryExample: Mail Industry
CLUSTERCLUSTER
To create partitions so that all members of To create partitions so that all members of each set are similar according to some each set are similar according to some metricmetric
Simply a set of objects grouped together Simply a set of objects grouped together by virtue of their similarity or proximity to by virtue of their similarity or proximity to each othereach other
Example: Credit Card TransactionsExample: Credit Card Transactions
DATA MINING DATA MINING TECHNOLOGIESTECHNOLOGIES
Providing new answers to old questionsProviding new answers to old questions Developing new knowledge and understanding Developing new knowledge and understanding
through discoverythrough discovery Statistical Analysis – statistically evaluating Statistical Analysis – statistically evaluating
products and making a decision based on logical products and making a decision based on logical reasoningreasoning
Neural Networks – attempts to mirror the way Neural Networks – attempts to mirror the way the human brain works in recognizing patterns the human brain works in recognizing patterns by developing mathematical structures with the by developing mathematical structures with the ability to learnability to learn
DATA MINING DATA MINING TECHNOLOGIES CONT’TECHNOLOGIES CONT’
Genetic Algorithms and Fuzzy Logic – machine Genetic Algorithms and Fuzzy Logic – machine learning techniques derive meaning from learning techniques derive meaning from complicated and imprecise data and can extract complicated and imprecise data and can extract patterns from and detect trends within the data patterns from and detect trends within the data that are far too complex to be noticed by that are far too complex to be noticed by humanshumans
Decision Trees – assists in data mining Decision Trees – assists in data mining applications by the classification of items or applications by the classification of items or events contained within the warehouse events contained within the warehouse
NEW APPLICATIONS FOR NEW APPLICATIONS FOR DATA MININGDATA MINING
Two new categories of applicationsTwo new categories of applications
1) Text Mining – summarizes, navigates, and 1) Text Mining – summarizes, navigates, and clusters documents contained in a databaseclusters documents contained in a database
2) Web Mining – integrates data and text mining 2) Web Mining – integrates data and text mining within a Web site; enhances the Web site with within a Web site; enhances the Web site with intelligent behavior, such as suggesting related intelligent behavior, such as suggesting related links or recommending new products to the links or recommending new products to the consumerconsumer
Market Basket AnalysisMarket Basket Analysis
• Market Basket Analysis is an algorithm that examines a long list of transactions in order to determine which items are most frequently purchased together.
• It takes its name from the idea of a person in a supermarket throwing all of their items into a shopping cart (a "market basket").
• Market basket analysis one of the most common and useful types of data analysis for marketing.
• With the data gathered from MBA, marketers can group products that customers like and group them together.
• Market basket analysis can improve the effectiveness of marketing and sales tactics.
Benefits of Market Basket Analysis:Benefits of Market Basket Analysis:
•A good indication of consumer behavior
•Increase in sales
•Improves customer satisfaction
•Tracks what types of products interest consumer and finds relative alternative ones to introduce to the consumer.
ASSOCIATION RULES for MBAASSOCIATION RULES for MBA
• Support
• Confidence
• Lift
•Method
Association rules- are a common undirected data mining technique and complement market basket analysis.
These rules are unidirectional
Left-hand side rule IMPLIES Right-hand side rule
ex. Pasta IMPLIES Wine, but Wine IMPLIES Pasta may not hold
40% of transactions that contain Pasta also contain Wine. 4% of transaction contain both of these items.
Support- % measure of baskets where the association rule is true between the Left-hand side & the Right-hand side.
ex. 4% of transactions contain both
Confidence- Probability that the Right-hand side item is present once the Left-hand side item is present.
ex. 40% of transactions that contain Pasta… p=.40
Lift- compares the likelihood of finding the right-hand side item in any random basket. Measures how well and associative rules performs by comparing how well an item can sell without the other item (improvement).
MethodMethod
Frozen Pizza
Milk Cola Potato Chips Pretzels
Frozen Pizza 2 1 2 0 0
Milk 1 3 1 1 1
Cola 2 1 3 0 1
Potato Chips
0 1 0 1 0
Pretzels 0 1 1 0 2
Market Basket AnalysisMarket Basket Analysis
Market Basket analysis- determines what products customers purchase together
Limits to Market Basket AnalysisLimits to Market Basket Analysis
• A large number of data is req. to obtain meaningful data, but data’s accuracy is compromised if all the products don’t occur w/in similar frequency.
• ex. Milk sells almost every transaction, but Elmer’s glue sells sporadically, its not effective to put them in same basket analysis.
• Sometimes presents results that are actually due to the success of previous market campaigns.• ex. Discounted price of cola with purchase of pizza.
Using Data from MBAUsing Data from MBA
Once information has been gathered about different Once information has been gathered about different items and how they sell with respect to other items, items and how they sell with respect to other items, a store may want to change their layout of items to a store may want to change their layout of items to improve their profits.improve their profits.
ex. Lunchboxes and School Suppliesex. Lunchboxes and School Supplies
For business without an actual storefront, they may want For business without an actual storefront, they may want to offer promotions for products that sell together-to offer promotions for products that sell together-increasing sales.increasing sales.
Current Limitations & Challenges to Current Limitations & Challenges to Data MiningData Mining
New and underdeveloped fieldNew and underdeveloped field
Identification of missing informationIdentification of missing information Most companies run legacy systemsMost companies run legacy systems Not DW (data warehouse) friendlyNot DW (data warehouse) friendly DW designers have to convert existing ODSs DW designers have to convert existing ODSs
(operational data stores) to homogenous form (operational data stores) to homogenous form of DWof DW
Current Limitations & Challenges to Current Limitations & Challenges to Data MiningData Mining
Not all knowledge about application Not all knowledge about application domains are present in the datadomains are present in the data
ODSs are normally limited to those ODSs are normally limited to those needed by the operational application needed by the operational application associated with that DBassociated with that DB
Data warehouse designers need to include Data warehouse designers need to include mechanisms for “inventorying” datamechanisms for “inventorying” data
Data noise & missing valuesData noise & missing values
Most operational databases contain data Most operational databases contain data errors in their values and/or classificationerrors in their values and/or classification Errors lead to misclassificationErrors lead to misclassification
Future data mining systems must incorporate Future data mining systems must incorporate more sophisticated mechanisms for treating more sophisticated mechanisms for treating “noisy data”“noisy data” Bayesian technique Bayesian technique – a statistical technique– a statistical technique
Large Databases & high Large Databases & high dimensionalitydimensionality
Databases are large & dynamicDatabases are large & dynamic Contents are always changingContents are always changing
Data patterns must be constantly updatedData patterns must be constantly updated
New discovery applications have to portion New discovery applications have to portion problems into smaller chunks of manageable problems into smaller chunks of manageable data without losing any essential attributes of data without losing any essential attributes of the datathe data
Data VisualizationData Visualization
Process by which numerical data are Process by which numerical data are converted into meaningful 3-D imagesconverted into meaningful 3-D images ExampleExample
Intended to analyze complex dataIntended to analyze complex data
Data from: satellite photos, sonar Data from: satellite photos, sonar measurements, surveys, or computer measurements, surveys, or computer simulationssimulations
History of Data VisualizationHistory of Data Visualization
Originated from statistics and scienceOriginated from statistics and science Example of 2-DExample of 2-D
Advancement credited to NCSAAdvancement credited to NCSA National Center for Supercomputing ApplicatiNational Center for Supercomputing Applicati
onsons
Newest developments by Newest developments by Xerox PARC Xerox PARC in in virtual realityvirtual reality
Human Visual PerceptionHuman Visual Perception
Human visual cortex dominates our Human visual cortex dominates our perceptionperception
Accelerates the identification of hidden Accelerates the identification of hidden patterns in datapatterns in data ““A picture is worth a thousand words”A picture is worth a thousand words”
Geographical Information Systems Geographical Information Systems (GIS)(GIS)
A special-purpose DB which common spatial A special-purpose DB which common spatial coordinate system is primary means of coordinate system is primary means of referencereference
Requires:Requires:1.1. Data input Data input 2.2. Data storage, retrieval, and queryData storage, retrieval, and query3.3. Data transformation, analysis, and modelingData transformation, analysis, and modeling4.4. Data reportingData reporting
Integrates info. and aids in Integrates info. and aids in decision makingdecision making
GIS continuedGIS continued
Spatial Data – elements stored in map Spatial Data – elements stored in map formform
• Contain three basic components:Contain three basic components:1.1. PointsPoints
2.2. LinesLines
3.3. PolygonsPolygons
Attribute Data – describes spatial dataAttribute Data – describes spatial data Example of GISExample of GIS
Applications of Data Visualization Applications of Data Visualization TechniquesTechniques
Retail BankingRetail Banking GovernmentGovernment InsuranceInsurance Health Care and MedicineHealth Care and Medicine TelecommunicationsTelecommunications TransportationTransportation Capital MarketsCapital Markets Asset ManagementAsset Management
Siftware TechnologiesSiftware Technologies
IBMIBM InformixInformix Red BrickRed Brick DB2DB2
OracleOracle Silicon GraphicsSilicon Graphics SybaseSybase
Offers several Data Mining solutions, depending Offers several Data Mining solutions, depending on users need.on users need.
IBM Information Warehouse SolutionsIBM Information Warehouse Solutions
IBM VisualizerIBM Visualizer
Red BrickRed Brick
InformixInformix
Three-tier modelThree-tier model Tier 1: “Client” presentation layerTier 1: “Client” presentation layer
Tier 2: Hewlett-Packard hardwareTier 2: Hewlett-Packard hardware
Tier 3: Data layer INFORMIX –OnLine Tier 3: Data layer INFORMIX –OnLine databasedatabase
Sybase Warehouse WORKSSybase Warehouse WORKS Assemble data from may sourcesAssemble data from may sources
Transform data for a consistent and understandable Transform data for a consistent and understandable viewview
Distribute data where neededDistribute data where needed
Provide high-speed access to the dataProvide high-speed access to the data
Leading company for large-scale data miningLeading company for large-scale data mining
Data spread across mutliple databasesData spread across mutliple databases
Data spread across processors for faster Data spread across processors for faster queriesqueries
Discover new patterns and trends that may not Discover new patterns and trends that may not be realized using traditional SQLbe realized using traditional SQL
Three-dimensional VisualizationThree-dimensional Visualization
Visual models can save days and even months Visual models can save days and even months from the review processfrom the review process
ReviewReview
Data mining (DM)Data mining (DM)
Techniques used to mine dataTechniques used to mine data
Market Basket Analysis: The King of DM Market Basket Analysis: The King of DM AlgorithmsAlgorithms