dat204 introduction to data mining with sql server 2000 zhaohui tang program manager sql server...
TRANSCRIPT
DAT204DAT204
Introduction to Data Mining with Introduction to Data Mining with SQL Server 2000 SQL Server 2000
ZhaoHui TangZhaoHui Tang
Program Manager Program Manager
SQL Server Analysis ServicesSQL Server Analysis Services
Microsoft CorporationMicrosoft Corporation
AgendaAgenda
• What is Data MiningWhat is Data Mining• The Data Mining MarketThe Data Mining Market• OLE DB for Data MiningOLE DB for Data Mining• Overview of the Data Mining Features in Overview of the Data Mining Features in
SQL Server 2000SQL Server 2000• DemoDemo• Q&AQ&A
What Is Data Mining?What Is Data Mining?
What is DM?What is DM?
• A process of data exploration and analysis A process of data exploration and analysis using automatic or semi-automatic meansusing automatic or semi-automatic means– Techniques origin from Machine Learning, statistics and Techniques origin from Machine Learning, statistics and
databasedatabase– ““Exploring data” – scanning samples of known facts Exploring data” – scanning samples of known facts
about “cases”.about “cases”.– ““knowledge”: knowledge”: Clusters, Rules, Decision treesClusters, Rules, Decision trees, , Equations, Equations,
Association rules…Association rules…
• Once the “knowledge” is extracted it:Once the “knowledge” is extracted it:– Can be browsed Can be browsed
• Provides a very useful insight on the cases behaviorProvides a very useful insight on the cases behavior– Can be used to predict values of other casesCan be used to predict values of other cases
• Can serve as a key element in closed loop analysisCan serve as a key element in closed loop analysis
What drives high school students What drives high school students to attend college?to attend college?
The deciding factors for high school The deciding factors for high school students to attend college are…students to attend college are…
Attend College:55% Yes45% No
All Students
Attend College:79% Yes21% No
IQ=High
Attend College:45% Yes55% No
IQ=Low
IQ ?
Wealth
Attend College:94% Yes6% No
Wealth = True
Attend College:69% Yes21% No
Wealth = False
ParentsEncourage?
Attend College:70% Yes30% No
Attend College:31% Yes69% No
ParentsEncourage = No
ParentsEncourage = Yes
Business Oriented DM ProblemsBusiness Oriented DM Problems
• Targeted adsTargeted ads– ““What banner should I display to this visitor?”What banner should I display to this visitor?”
• Cross sellsCross sells– ““What other products is this customer likely to buy?What other products is this customer likely to buy?
• Fraud detectionFraud detection– ““Is this insurance claim a fraud?”Is this insurance claim a fraud?”
• Churn analysisChurn analysis– ““Who are those customers likely to churn?”Who are those customers likely to churn?”
• Risk ManagementRisk Management– ““Should I approve the loan to this customer?”Should I approve the loan to this customer?”
• … …
Mining Model
Mining Process - IllustratedMining Process - Illustrated
DMEngine
Data To Predict
DMEngine
Predicted Data
Training Data
Mining Model
Mining Model
The Data Mining MarketThe Data Mining Market
The $$$: Market Size The $$$: Market Size
• DM Tools Market: DM Tools Market: – 1999: $341.3M1999: $341.3M– 2000: $455.1M2000: $455.1M– 2001: $449.5M2001: $449.5M
* IDC
The PlayersThe Players
• Leading vendorsLeading vendors– SASSAS– SPSSSPSS– IBMIBM– AngossAngoss– Hundreds of smaller vendors offering DM Hundreds of smaller vendors offering DM
algorithms…algorithms…
• Oracle –Thinking Machines acquisitionOracle –Thinking Machines acquisition
The ProductsThe Products
• End-to-end horizontal DM toolsEnd-to-end horizontal DM tools– Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts
workbench, Reporting, Charting….workbench, Reporting, Charting….
• The customer is the power-analystThe customer is the power-analyst– PhD in statistics is usually required…PhD in statistics is usually required…
• Closed tools – no standard APIClosed tools – no standard API– Total vendor lock-inTotal vendor lock-in– Limited integration with applicationsLimited integration with applications
• DM an “outsider” in the Data WarehouseDM an “outsider” in the Data Warehouse• Extensive consulting requiredExtensive consulting required• Sky rocketing pricesSky rocketing prices
– $60K+ for a single user license$60K+ for a single user license
What the analysts say…What the analysts say…
• ““Stand-alone Data Mining Is Dead” - Stand-alone Data Mining Is Dead” - ForresterForrester
• ““The demise of [stand alone] data The demise of [stand alone] data mining” – Gartnermining” – Gartner
The Microsoft ApproachThe Microsoft Approach
DataPro Users Survey DataPro Users Survey 1999-20011999-2001
““Data mining will be the fastest-Data mining will be the fastest-growing BI technology…”growing BI technology…”
Market Size of BIMarket Size of BI
* IDC
SQL Server 2000 - The Analysis SQL Server 2000 - The Analysis PlatformPlatform
• SQL 2000 provides a complete Analysis SQL 2000 provides a complete Analysis PlatformPlatform– Not an isolated, stand alone DM productNot an isolated, stand alone DM product
• Platform means:Platform means:– Standard based DM API’s (OLE DB for DM) for Standard based DM API’s (OLE DB for DM) for
applications developmentapplications development– Integrated vision for all technologies, toolsIntegrated vision for all technologies, tools– ExtensibleExtensible– ScaleableScaleable
Data FlowData Flow
DWOLTP OLAP
DMAppsReports
& Analysis
DM
Analysis Services 2000 –Analysis Services 2000 –ComponentsComponents
Manager UI
DSO
Analysis Server Client
OLE DB OLAP
OLAPEngine(local)
OLAPEngine
DMEngine
DMEngine(local)
DM
DMM
DM Wizards
DM DTS Task
Tree View Control
Cluster View Control
Lift Chart Control
Sample Query Tool
OLE DB for Data Mining…OLE DB for Data Mining…
Why OLE DB for DM?Why OLE DB for DM?
Make DM a Make DM a mass market technologymass market technology by: by:• Leverage existing technologies and knowledge Leverage existing technologies and knowledge
– SQL and OLE DB SQL and OLE DB
• Common industry wide concepts and data Common industry wide concepts and data presentationpresentation
• Changing DM market perception from “proprietary” to Changing DM market perception from “proprietary” to “open”“open”
• Increasing the number of players:Increasing the number of players:– Reduce the cost and risk of becoming a consumer – one tool works with Reduce the cost and risk of becoming a consumer – one tool works with
multiple providersmultiple providers– Reduce the cost and risk of becoming a provider – focus on expertise Reduce the cost and risk of becoming a provider – focus on expertise
and find many partners to complement offeringand find many partners to complement offering
Integration With RDBMSIntegration With RDBMS
• Customers would like to Customers would like to – Build DM models from within their RDBMSBuild DM models from within their RDBMS– Train the models directly off their relational tablesTrain the models directly off their relational tables– Perform predictions as relational queries (tables in, Perform predictions as relational queries (tables in,
tables out)tables out)– Feel that DM is a native part of their database.Feel that DM is a native part of their database.
• Therefore…Therefore…– Data mining models are relational objectsData mining models are relational objects– All operations on the models are relationalAll operations on the models are relational– The language used is SQL (w/Extensions)The language used is SQL (w/Extensions)
• The effect: every DBA and VB developer can The effect: every DBA and VB developer can become a DM developerbecome a DM developer
Creating a Data Mining Model Creating a Data Mining Model (DMM)(DMM)
Identifying the “Cases”Identifying the “Cases”
• DM algorithms analyze “cases”DM algorithms analyze “cases”• The “case” is the entity being categorized and The “case” is the entity being categorized and
classifiedclassified• ExamplesExamples
– Customer credit risk analysis: Customer credit risk analysis: Case = CustomerCase = Customer– Product profitability analysis: Product profitability analysis: Case = ProductCase = Product– Promotion success analysis: Promotion success analysis: Case = PromotionCase = Promotion
• Each case encapsulate all we know about the Each case encapsulate all we know about the entityentity
A Simple Set of CasesA Simple Set of Cases
StudentIStudentIDD
GendeGenderr
Parent Parent
IncomeIncomeIQIQ EncouragementEncouragement
CollegeCollege
PlansPlans
11 MaleMale 2340023400 120120 Not EncouragedNot Encouraged NoNo
22 FemaleFemale 7920079200 9090 EncouragedEncouraged YesYes
33 MaleMale 4200042000 105105 Not EncouragedNot Encouraged YesYes
More Complicated CasesMore Complicated Cases
Cust Cust IDID
AgeAge
MaritMaritalal
StatuStatuss
IQIQ
Favorite MoviesFavorite Movies
TitleTitle ScoreScore
11 3535 MM 22 Star WarsStar Wars 88
Toy StoryToy Story 99
TerminatorTerminator 77
22 2020 SS 33 Star WarsStar Wars 77
BraveheartBraveheart 77
The MatrixThe Matrix 1010
33 5757 MM 22 Sixth SenseSixth Sense 99
CasablancaCasablanca 1010
A DMM is a Table!A DMM is a Table!
• A DMM structure is defined as a tableA DMM structure is defined as a table– Training a DMM means inserting data (pattern) Training a DMM means inserting data (pattern)
into the tableinto the table– Predicting from a DMM means querying the Predicting from a DMM means querying the
tabletable
• All information describing the case are All information describing the case are contained in columnscontained in columns
Creating a Mining ModelCreating a Mining Model
CREATE MINING MODEL [Plans Prediction]CREATE MINING MODEL [Plans Prediction]
((
StudentID LONG KEY,StudentID LONG KEY,
Gender TEXT DISCRETE,Gender TEXT DISCRETE,
ParentIncome LONG CONTINUOUS,ParentIncome LONG CONTINUOUS,
IQ DOUBLE CONTINUOUS,IQ DOUBLE CONTINUOUS,
Encouragement TEXT DISCRETE, Encouragement TEXT DISCRETE,
CollegePlans TEXT DISCRETE PREDICTCollegePlans TEXT DISCRETE PREDICT
))
USING Microsoft_Decision_TreesUSING Microsoft_Decision_Trees
Creating a mining model with Creating a mining model with nested tablenested table
Create Mining Model MoviePrediction Create Mining Model MoviePrediction
( (
CutomerId long key, CutomerId long key,
Age long continuous, Age long continuous,
Gender discrete,Gender discrete,
Education discrete,Education discrete,
MovieList table predict ( MovieList table predict (
MovieName text key MovieName text key
) )
) )
using microsoft_decision_treesusing microsoft_decision_trees
Training a DMMTraining a DMM
Training a DMMTraining a DMM
• Training a DMM means passing it data for which the Training a DMM means passing it data for which the attributes to be predicted are knownattributes to be predicted are known– Multiple passes are handled internally by the provider!Multiple passes are handled internally by the provider!
• Use an INSERT INTO statementUse an INSERT INTO statement• The DMM will not persist the inserted data The DMM will not persist the inserted data • Instead it will analyze the given cases and build the Instead it will analyze the given cases and build the
DMM content (decision tree, segmentation model, DMM content (decision tree, segmentation model, association rules)association rules)
INSERT [INTO] <mining model name>INSERT [INTO] <mining model name>
[(columns list)][(columns list)]<source data query><source data query>
INSERT INTOINSERT INTO
INSERT INTO [Plans PredictionPlans Prediction](StudentID, Gender, ParentIncome, IQ,Encouragement, CollegePlans)SELECT
[StudentID], [Gender], [ParentIncome], [IQ],[Encouragement], [CollegePlans]
FROM [Students]
When Insert Into Is Done…When Insert Into Is Done…
• The DMM is trainedThe DMM is trained– The model can be retrained The model can be retrained – Content (rules, trees, formulas) can be Content (rules, trees, formulas) can be
exploredexplored– OLE DB Schema rowsetOLE DB Schema rowset– SELECT * FROM <dmm>.CONTENTSELECT * FROM <dmm>.CONTENT– XML string (PMML)XML string (PMML)
• Prediction queries can be executedPrediction queries can be executed
PredictionsPredictions
What are Predictions?What are Predictions?• Predictions apply the rules of a trained Predictions apply the rules of a trained
model to a new set of data in order to model to a new set of data in order to estimate missing attributes or valuesestimate missing attributes or values
• Predictions = queriesPredictions = queries– The syntax is SQL - likeThe syntax is SQL - like– The output is a rowsetThe output is a rowset
• In order to predict you need:In order to predict you need:– Input data setInput data set– A trained DMMA trained DMM– Binding (mapping) information between the Binding (mapping) information between the
input data and the DMMinput data and the DMM
The Truth Table ConceptThe Truth Table Concept
GendeGenderr
Parent Parent
IncomeIncomeIQIQ EncouragementEncouragement
CollegCollegee
PlansPlans
ProbabilitProbabilityy
MaleMale 2000020000 8585 Not EncouragedNot Encouraged NoNo 85%85%
MaleMale 2000020000 8585 Not EncouragedNot Encouraged YesYes 15%15%
MaleMale 2000020000 8585 EncouragedEncouraged NoNo 60%60%
MaleMale 2000020000 8585 EncouragedEncouraged YesYes 40%40%
MaleMale 2000020000 9090 Not EncouragedNot Encouraged NoNo 80%80%
MaleMale 2000020000 9090 Not EncouragedNot Encouraged YesYes 20%20%
MaleMale 2000020000 9090 EncouragedEncouraged NoNo 58%58%
……
PredictionPrediction
GenderGender ParentParent
IncomeIncome
IQIQ EncouragementEncouragement College College PlansPlans
ProbabilityProbability
MaleMale 2000020000 8585 Not EncouragedNot Encouraged NoNo 85%85%
MaleMale 2000020000 8585 Not EncouragedNot Encouraged YesYes 15%15%
MaleMale 2000020000 8585 EncouragedEncouraged NoNo 60%60%
MaleMale 2000020000 8585 EncouragedEncouraged YesYes 40%40%
MaleMale 2000020000 9090 Not EncouragedNot Encouraged NoNo 80%80%
MaleMale 2000020000 9090 Not EncouragedNot Encouraged YesYes 20%20%
MaleMale 2000020000 9090 EncouragedEncouraged NoNo 58%58%
MaleMale 2000020000 9090 EncouragedEncouraged YesYes 42%42%
MaleMale 2000020000 9595 Not EncouragedNot Encouraged NoNo 78%78%
MaleMale 2000020000 9595 Not EncouragedNot Encouraged YesYes 22%22%
MaleMale 2000020000 9595 EncouragedEncouraged NoNo 45%45%
It’s a JOIN!It’s a JOIN!
StudentIStudentIDD
GenderGender ParentParent
IncomeIncome
IQIQ EncouragementEncouragement
11 MaleMale 4300043000 8585 Not EncouragedNot Encouraged
22 MaleMale 2000020000 135135 Not EncouragedNot Encouraged
33 FemaleFemale 2500025000 105105 EncouragedEncouraged
44 MaleMale 9600096000 100100 EncouragedEncouraged
55 FemaleFemale 5600056000 125125 Not EncouragedNot Encouraged
66 FemaleFemale 4600046000 9090 Not EncouragedNot Encouraged
The Prediction Query SyntaxThe Prediction Query Syntax
SELECT SELECT <columns to return or predict><columns to return or predict>
FROM FROM
<dmm> <dmm> PREDICTION JOIN PREDICTION JOIN
<input data set><input data set>
ONON <dmm column> <dmm column> = = <dmm input column>…<dmm input column>…
ExampleExample
SELECT SELECT [New Students].[StudentID], [New Students].[StudentID],
[Plans Prediction].[CollegePlans], [Plans Prediction].[CollegePlans],
PredictProbability([CollegePlans])PredictProbability([CollegePlans])
FROM FROM
[Plans Prediction] [Plans Prediction] PREDICTION JOINPREDICTION JOIN
[New Students][New Students]
ON ON [Plans Prediction].[Gender][Plans Prediction].[Gender] = =
[New Students].[Gender] [New Students].[Gender] ANDAND
[Plans Prediction].[IQ][Plans Prediction].[IQ] = =
[New Students].[IQ] [New Students].[IQ] AND ...AND ...
DemoDemo
OLE DB for Data Mining Defines OLE DB for Data Mining Defines APIAPI
OLE DB for DM (API)
RDBMS
Consumer
Provider
CubeMisc. Data
Source
Provider Provider
Consumer ……
……
OLE DB
OLEDB for DM Configuration OLEDB for DM Configuration Options DemoOptions Demo
ConsumersConsumers
OLEDB for DMOLEDB for DM
ProvidersProviders
MS AnalysisManager
MS DMProvider
ANGOSS DMProvider
ANGOSSControls
1122 33
44
Demo on OLE DB for DM API Demo on OLE DB for DM API using Angoss Controls using Angoss Controls and Providerand Provider
For more info…For more info…
• DM URLDM URL– www.microsoft.com/data/oledbwww.microsoft.com/data/oledb– www.microsoft.com/data/www.microsoft.com/data/oledb/DMResKit.htmoledb/DMResKit.htm
• News Group:News Group:– Microsoft.public.SQLserver.dataminingMicrosoft.public.SQLserver.datamining– Communities.msn.com/AnalysisServicesDataMiningCommunities.msn.com/AnalysisServicesDataMining
• White papers:White papers:– Performance paper:Performance paper:
www.unisys.com/windows2000/default-07.asp www.unisys.com/windows2000/default-07.asp
www.microsoft.com/SQL/evaluation/compare/analysisdmwp.aspwww.microsoft.com/SQL/evaluation/compare/analysisdmwp.asp
Questions ?Questions ?