data mining david klein
DESCRIPTION
TRANSCRIPT
Data Mining
David Klein & Adam Cogan
Admin Stuff
• Attendance– You initial sheet
• Hands On Lab– You get me to initial sheet
• Certificate – At end of 10 sessions– If I say you have completed successfully
About
• David Klein is a Senior Software Architect at SSW, specialising in .NET & SQL Server & BI solutions– Current Clients – Sally Knox Medical & Pisces
• Adam Cogan is Chief Architect at SSW and one of 2 Microsoft Regional Directors in Australia, specialising in Office, SQL and .NET solutions
Course Overview
The 5 Sessions (Part B)
1. SSIS and Creating a Data Warehouse2. Creating a Cube and Cube Issues3. Reporting Services4. Other Cube Browsers5. Data Mining http://www.ssw.com.au/ssw/events/2006SQL/
Session 5: Tonight’s Agenda
1. Why Data Mining?2. Uses3. Algorithms4. Implementation
– Sql Server Management Studio SSMS– Reporting Services– Sql Server Integration Services– DMX
5. Demo ?6. Hands on Lab
Why Data Mining?
• Marketing– Who picks the movie? The kids, the wife, me.– Who are our Customers and what sort of films
do they hire?– Is a 30 year old woman with 2 children going to
hire Arnie’s latest film
• Validation– Is this data sensible? Terminator 2 and Toy
Story
• Prediction– Sales Next Year
Complete Set Of Algorithms
Decision TreesDecision Trees ClusteringClustering Time SeriesTime Series
Sequence Sequence ClusteringClustering
AssociationAssociation Naïve BayesNaïve Bayes
Neural Neural NNetetss
Naïve Bayes
• Quickly builds mining models that can be used for classification and prediction
• It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute– This can later be used to predict an outcome
of the predicted attribute based on the known input attributes
• This makes the model a good option for exploring the data
Naïve Bayes – Toy Story 2
Decision Trees (1)
• Decision Trees assign (classify) each case to one of a few (discrete) broad categories of selected attribute (variable) and explains the classification with few selected input variables
• The process of building is recursive partitioning – splitting data into partitions and then splitting it up more
• Initially all cases are in one big box
Decision Trees (2)
• The algorithm tries all possible breaks in classes using all possible values of each input attribute; it then selects the split that partitions data to the purest classes of the searched variable– Several measures of purity
• Then it repeats splitting for each new class– Again testing all possible breaks
• Unuseful branches of the tree can be pre-pruned or post-pruned
Decision Trees (3)
• Decision trees are used for classification and prediction
• Typical questions:– Predict which customers will leave– Help in mailing and promotion campaigns– Explain reasons for a decision– What are the movies young female customers
like to buy?
Decision Trees – Who Decides
Cluster Analysis (1)
• Grouping data into clusters– Objects within a cluster have high similarity based on
the attribute values
• The class label of each object is not known• Several techniques
– Partitioning methods– Hierarchical methods– Density based methods– Model based methods– And more…
Cluster Analysis (2)
• Segments a heterogeneous population into a number of more homogenous subgroups or clusters
• Some typical questions:– Discover distinct groups of customers– Identification of groups of houses in a city– In biology, derive animal and plant taxonomies– Find outliers
Conclusion: When To Use What
Analytical problem Examples Algorithms
Classification: Assign cases to predefined classes
Credit risk analysisChurn analysisCustomer retention
Decision TreesNaive BayesNeural Nets
Segmentation: Taxonomy for grouping similar cases
Customer profile analysisMailing campaign
ClusteringSequence Clustering
Association: Advanced counting for correlations
Market basket analysisAdvanced data exploration
Decision TreesAssociation
Time Series Forecasting: Predict the future
Forecast salesPredict stock prices
Time Series
Prediction: Predict a value for a new case based on values for similar cases
Quote insurance ratesPredict customer income
All
Deviation analysis: Discover how a case or segment differs from others
Credit card fraud detectionNetwork infusion analysis
All
Summary
• Why Data Mining?• Uses• Algorithms• Implementation
– Sql Server Management Studio SSMS– Reporting Services– Sql Server Integration Services– DMX
• Demo ?• Hands on Lab
Book
Data Mining with SQL Server 2005ZhaoHui Tang and Jamie MacLennanWiley Press
BI is Cool