more value from data using data mining allan mitchell sql server mvp
TRANSCRIPT
Who am I
• SQL Server MVP• SQL Server Consultant• Joint author on Wrox Professional SSIS book• Worked with SQL Server since version 6.5• www.SQLDTS.com and www.SQLIS.com• Partner of SQL Know How
Today’s Schedule
• what is data mining (Overview)• data mining terminology• myths around data mining• excel AddIn to Office2007
– Demo Setup– Demo Key Influencers– Demo Categories– Demo Make a Prediction– Demo “Other stuff” – if time
• Questions and answers
What is Data Mining
• The process of using statistical techniques to discover subtle relationships between data items, and the construction of predictive models based on them. The process is not the same as just using an OLAP tool to find exceptional items. Generally, data mining is a very different and more specialist application than OLAP, and uses different tools from different vendors. Normally the users are different, too. OLAP vendors have had little success with their data mining efforts.
OLAP REPORT
What does Data Mining Do?
Explores Your Data
Finds Patterns
Performs Predictions
Query, Reporting, Analysis Data Mining
What Why
How
Comparative BenefitsPredictive Projects versus Nonpredictive Projects
0%
10%
20%
30%
40%
50%
60%
70%
80%
Technology Productivity Business ProcessEnhancement
Predictive Nonpredictive
Source: IDC, 2003
Data Mining terminology
• mining structure• mining model• mining algorithm• training dataset• testing dataset
SQL Server 2005 Algorithms
Decision Trees Clustering Time Series
Sequence Clustering
Association Naïve Bayes
Neural NetPlus: Linear and Logistic Regression
Sequence Clustering
• Applied to– Click stream analysis– Customer segmentation with
sequence data– Sequence prediction
• Mix of clustering and sequence technologies
• Group individuals based on their profiles including sequence data
Time Series
• Applied to– Forecast sales– Web hits prediction– Stock value estimation
• Patented technique from Microsoft Research
• Uses regression tree technology to describe and predict series values
Clustering• Applied to
– Segmentation: Customer grouping, Mailing campaign
– Also support classification and regression
• Expectation Maximization– Probabilistic Clustering
• K-Means– Distance based
• Clusters both discrete and continuous values– Discrete values are “binarized”
• Anomaly detection• Check variable independence
– “Predict Only” attributes not used for clustering
dm data flow
Cube
HistoricalDataset
NewDataset
Data Transform (SSIS)Reporting
Mining Models
ModelBrowsing
Prediction
LOBApplication
Cube
DMX CREATE MINING MODEL CreditRisk
(CustID LONG KEY,
Gender TEXT DISCRETE,
Income LONG CONTINUOUS,
Profession TEXT DISCRETE,
Risk TEXT DISCRETE PREDICT)
USING Microsoft_Decision_Trees
CREATE MINING MODEL CreditRisk
(CustID LONG KEY,
Gender TEXT DISCRETE,
Income LONG CONTINUOUS,
Profession TEXT DISCRETE,
Risk TEXT DISCRETE PREDICT)
USING Microsoft_Decision_Trees
INSERT INTO CreditRisk
(CustId, Gender, Income, Profession, Risk)
Select
CustomerID, Gender, Income, Profession,Risk
From Customers
INSERT INTO CreditRisk
(CustId, Gender, Income, Profession, Risk)
Select
CustomerID, Gender, Income, Profession,Risk
From Customers
Select NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk)
FROM CreditRisk PREDICTION JOIN NewCustomers
ON CreditRisk.Gender=NewCustomer.Gender
AND CreditRisk.Income=NewCustomer.Income
AND CreditRisk.Profession=NewCustomer.Profession
Select NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk)
FROM CreditRisk PREDICTION JOIN NewCustomers
ON CreditRisk.Gender=NewCustomer.Gender
AND CreditRisk.Income=NewCustomer.Income
AND CreditRisk.Profession=NewCustomer.Profession
Myths around data mining
• You have to be a propeller head
• It’s a new concept.• Only works with SSAS cubes
Excel 2007 DMAddin
• DM visualisation• table analysis• Create session models/permanent models• Connect to ssas for full blown models• intuitive interface