issues in data mining applications -tutorial-
DESCRIPTION
Issues in Data Mining Applications -Tutorial-. How to Make A Decision About Your Own Data Mining Tool?. Valentina Milenkovic , [email protected] Prof. Dr. Veljko Milutinovic , [email protected]. Authors:. This material was developed with financial help of the WUSA fund of Austria. - PowerPoint PPT PresentationTRANSCRIPT
Issues in Data Mining ApplicationsIssues in Data Mining Applications-Tutorial--Tutorial-
Valentina Milenkovic, [email protected] Milenkovic, [email protected]
Prof. Dr. Veljko Milutinovic, [email protected]. Dr. Veljko Milutinovic, [email protected]:Authors:
How to Make A DecisionHow to Make A DecisionAbout Your Own Data Mining Tool?About Your Own Data Mining Tool?
This material was developed with financial help of the WUSA fund of Austria.
Page Number: 2
Data Mining vs. Knowledge Mining = ?Data Mining vs. Knowledge Mining = ?
??
Page Number: 3
Evolution Of Data MiningEvolution Of Data Mining
Prospective, proactive information delivery
Lockheed,
IBM, SGI,
numerous startups
Advanced algorithms, multiprocessors, massive databases
What’s likely to happen to Boston unit sales next month? Why?
Data MiningData Mining
(2000)(2000)
Retrospective, dynamic data delivery at multiple levels
Pilot, IRI,
Arbor, Redbrick, Evolutionary Technologies
OLAP, Multidimensional databases,
data warehouses
What were unit sales in New England last March?
Drill down to Boston.
Data NavigationData Navigation
(1990s)(1990s)
Retrospective, dynamic data delivery at record level
Oracle, Sybase Informix, IBM, Microsoft
RDBMS,
SQL,
ODBC
What were unit sales in New England
last March?
Data AccessData Access
(1980s)(1980s)
Retrospective,
static data delivery
IBM,
CDC
Computers,
tapes,
disks
What was my average total revenue over the last 5 years?
Data Collection Data Collection (1960s)(1960s)
CharacteristicsProduct ProvidersEnabling Technologies
Business QuestionEvolutionary StepEvolutionary Step
Page Number: 4
Examples of DM projects to stimulate your imaginationExamples of DM projects to stimulate your imagination
Here are six examples of how data mining is helping corporations Here are six examples of how data mining is helping corporations to to operate more efficiently and profitably in today's business environmentoperate more efficiently and profitably in today's business environment.
– Targeting a set of consumers Targeting a set of consumers who are most likely to respond to a direct mail campaignwho are most likely to respond to a direct mail campaign
– Predicting the probability of default for consumer loan applicationsPredicting the probability of default for consumer loan applications
– Reducing fabrication flaws in VLSI chipsReducing fabrication flaws in VLSI chips
– Predicting audience share for television programsPredicting audience share for television programs
– Predicting the probability that a cancer patient Predicting the probability that a cancer patient will will respond to radiation therapyrespond to radiation therapy
– Predicting the probability that an offshore oil well is actually going Predicting the probability that an offshore oil well is actually going to produce oil to produce oil
Page Number: 5
Comparison of forteen DM toolsComparison of forteen DM tools
Evaluated by four undergraduates inexperienced at data mining, Evaluated by four undergraduates inexperienced at data mining, a relatively experienced graduate student and a relatively experienced graduate student and a profesional data mining consultant a profesional data mining consultant
Run under the MS Windows 95, MS Windows NT, Run under the MS Windows 95, MS Windows NT, Macintosh System 7.5Macintosh System 7.5
Use one of the four technologies: Use one of the four technologies: Decision Trees, Rule Inductions, Neural or Polynomial NetworksDecision Trees, Rule Inductions, Neural or Polynomial Networks
Solve two binary classification problems: Solve two binary classification problems: multi-class classification and noiseless estimation problem multi-class classification and noiseless estimation problem
Price from 75$ to 25.000$Price from 75$ to 25.000$
Page Number: 6
Comparison of forteen DM toolsComparison of forteen DM tools
The Decision Tree products were The Decision Tree products were - - CART CART
- Scenario - Scenario - See5 - See5
- S-Plus - S-Plus The Rule Induction tools were The Rule Induction tools were
- - WizWhy WizWhy - - DataMindDataMind
- - DMSK DMSK Neural Networks were built from three programsNeural Networks were built from three programs
- - NeuroShell2NeuroShell2- PcOLPARS - PcOLPARS
- - PRW PRW The Polynomial Network tools were The Polynomial Network tools were
- - ModelQuest Expert ModelQuest Expert - - Gnosis Gnosis - a module of - a module of NeuroShellNeuroShell22
- - KnowledgeMiner KnowledgeMiner
Page Number: 7
Criteria for evaluating DM toolsCriteria for evaluating DM tools
A list of 20 criteria for evaluating DM tools, put into 4 categories:A list of 20 criteria for evaluating DM tools, put into 4 categories:
CapabilityCapability measures what a desktop tool can do, measures what a desktop tool can do, and how well it does itand how well it does it
- Handless missing data- Handless missing data- Considers misclassification costs- Considers misclassification costs
- Allows data transformations- Allows data transformations- Quality of tesing options- Quality of tesing options
- Has programming - Has programming languagelanguage - Provides - Provides useful output reportsuseful output reports - - VisualisationVisualisation
Page Number: 8
Visualisation Visualisation
+ excellent capability excellent capability good capabilitygood capability - some capability “blank” no capabilitysome capability “blank” no capability
Page Number: 9
Criteria for evaluating DM toolsCriteria for evaluating DM tools
Learnability/UsabilityLearnability/Usability shows how easy a tool is to learn and use shows how easy a tool is to learn and use
- Tutorials- Tutorials- Wizards- Wizards
- Easy to learn- Easy to learn- User’s - User’s
manualmanual - Online help- Online help- -
Interface Interface
Page Number: 10
Criteria for evaluating DM toolsCriteria for evaluating DM tools
InteroperabilityInteroperability shows a tool’s ability to interface shows a tool’s ability to interface with other computer applicationswith other computer applications
- Importing data- Importing data- Exporting data- Exporting data
- Links to other applications- Links to other applications
Flexibility Flexibility
- Model adjustment flexibility- Model adjustment flexibility- Customizable work - Customizable work
enviromentenviroment - Ability to - Ability to write or change codewrite or change code
Page Number: 11
Data Input & Output ModelData Input & Output Model
+ excellent capability excellent capability good capabilitygood capability - some capabilitysome capability “ “blank” no capabilityblank” no capability
Page Number: 12
A classification of data setsA classification of data sets
Pima Indians Diabetes data setPima Indians Diabetes data set– 768 cases of Native American women from the Pima tribe 768 cases of Native American women from the Pima tribe some of some of
whom are diabetic, most of whom are not whom are diabetic, most of whom are not – 8 attributes plus the binary class variable for diabetes per instance8 attributes plus the binary class variable for diabetes per instance
Wisconsin Breast Cancer data set Wisconsin Breast Cancer data set – 699 instances of breast tumors 699 instances of breast tumors some of some of
which are malignant, most of which are benignwhich are malignant, most of which are benign– 10 attributes plus the binary malignancy variable per case10 attributes plus the binary malignancy variable per case
The Forensic Glass Identification data set The Forensic Glass Identification data set – 214 instances of glass collected during crime investigations 214 instances of glass collected during crime investigations – 10 attributes plus the multi-class output variable per instance10 attributes plus the multi-class output variable per instance
Moon Cannon data set Moon Cannon data set – 300 solutions to the equation:300 solutions to the equation:
x = 2v 2 sin(g)cos(g)/g x = 2v 2 sin(g)cos(g)/g – the data were generated without adding noisethe data were generated without adding noise
Page Number: 13
Evaluation of forteen DM toolsEvaluation of forteen DM tools
Page Number: 14
Strenghts and WeaknessesStrenghts and Weaknesses
StrengthsStrengths Ease of use Ease of use
(Scenario, WizWhy..)(Scenario, WizWhy..) Data visualisation Data visualisation (S-(S-
plus,MineSet...)plus,MineSet...) Depth of algorithms (tree options) Depth of algorithms (tree options)
(CART,See5,S-plus..)(CART,See5,S-plus..) Multiplte neural network Multiplte neural network
architectures architectures (NeuroShell)(NeuroShell)
WeaknessesWeaknesses Difficult file I/O Difficult file I/O
(OLPARS,CART)(OLPARS,CART) Limited visualisationLimited visualisation
(PRW,See5,WizWhy)(PRW,See5,WizWhy) Narrow analyses path Narrow analyses path
(Scenario)(Scenario)
Page Number: 15
How to improve How to improve existingexisting DM DM applicationsapplications
The top ten points:The top ten points: Database integrationDatabase integration
– no more flat filesno more flat files
– use the millions $ spent on data warehousinguse the millions $ spent on data warehousing
Automated model scoringAutomated model scoring
– without scoring DM is pretty uselesswithout scoring DM is pretty useless – should be integrated with the driving applicationsshould be integrated with the driving applications
Exporting models to other applicationsExporting models to other applications
– close the loop between DM and applications close the loop between DM and applications that need to use the results (scores) that need to use the results (scores)
Page Number: 16
How to improve How to improve existingexisting DM applications DM applications
Business templatesBusiness templates
– cross-selling specific application is more valuable cross-selling specific application is more valuable than a general modeling toolthan a general modeling tool
Effort knobEffort knob
– it is relevant in a way that tuning parametars are notit is relevant in a way that tuning parametars are not Incorporate financial informationIncorporate financial information
– the financial information is very important and often available the financial information is very important and often available and shold be provided as input to the DM application and shold be provided as input to the DM application
Page Number: 17
How to improve How to improve existingexisting DM applications DM applications
Computed target columnsComputed target columns
– allow the user to interactively create a new target variableallow the user to interactively create a new target variable Time-series dataTime-series data
– a year’s worth of monthly balance information is qualitatively a year’s worth of monthly balance information is qualitatively different than twelve distinct non-time-series variablesdifferent than twelve distinct non-time-series variables
Use versus ViewUse versus View
– do not present visually to user the full model,do not present visually to user the full model, only the most important levels only the most important levels
WizardsWizards
– not necessarily but desirablenot necessarily but desirable
– prevent human error by keeping the user on trackprevent human error by keeping the user on track
Page Number: 18
Potential ApplicationsPotential Applications
Data mining has many varied fields of application, Data mining has many varied fields of application,
some of which are listed below.some of which are listed below.
Retail/MarketingRetail/Marketing
Identify buying patterns from customers Identify buying patterns from customers
Find associations among customer demographic characteristics Find associations among customer demographic characteristics
Predict response to mailing campaigns Predict response to mailing campaigns
Market basket analysis Market basket analysis
Page Number: 19
Potential ApplicationsPotential Applications
• BankingBanking
Detect patterns of fraudulent credit card use Detect patterns of fraudulent credit card use
Identify `loyal' customers Identify `loyal' customers
Determine credit card spending by customer groups Determine credit card spending by customer groups
Find hidden correlations between different financial indicators Find hidden correlations between different financial indicators
Identify stock trading rules from historical market data Identify stock trading rules from historical market data
Page Number: 20
Potential ApplicationsPotential Applications
• Insurance and Health CareInsurance and Health Care
Claims analysis - i.e., which medical procedures are claimed together Claims analysis - i.e., which medical procedures are claimed together
Predict which customers will buy new policies Predict which customers will buy new policies
Identify behaviour patterns of risky customers Identify behaviour patterns of risky customers
Identify fraudulent behaviour Identify fraudulent behaviour
Page Number: 21
Potential ApplicationsPotential Applications
• TransportationTransportation
Determine the distribution schedules among outlets Determine the distribution schedules among outlets
Analyse loading patterns Analyse loading patterns
• MedicineMedicine
Characterise patient behaviour to predict office visits Characterise patient behaviour to predict office visits
Identify successful medical therapies for different illnessesIdentify successful medical therapies for different illnesses
To predict the effectiveness of surgical procedures or To predict the effectiveness of surgical procedures or medical tests medical tests
Page Number: 22
Potential ApplicationsPotential Applications
• SportSport
To make the best choice about players in different circumstanceTo make the best choice about players in different circumstance
To predict the results of relevance matchTo predict the results of relevance match
Do a better list of seed players in groups or tournamentDo a better list of seed players in groups or tournament
DM report from an NBA gameDM report from an NBA game
When Price was Point-Guard, J.Williams missed 0% (0) of his jump field-goal attempts and made 100% (4) of his jump field-goal-attempts.
The total number of such field-goal-attempts was 4.
Page Number: 23
DM and Customer Relationship ManagementDM and Customer Relationship Management
CRM is a process that manages the interactions CRM is a process that manages the interactions between a company and its customersbetween a company and its customers
Users of CRM software applications are database marketersUsers of CRM software applications are database marketers Goals of database marketers are:Goals of database marketers are:
identifying market segments, which requires significant data identifying market segments, which requires significant data about prospective customers and their buying behaviors about prospective customers and their buying behaviors
build and execute campaignsbuild and execute campaigns
Tightly integrating the two disciplines presents an opportunity Tightly integrating the two disciplines presents an opportunity for companies to gain competetive adventage for companies to gain competetive adventage
Page Number: 24
DM and Customer Relationship ManagementDM and Customer Relationship Management
How Data Mining helps Database MarketingHow Data Mining helps Database Marketing ScoringScoring The role of Campaign Management SoftwareThe role of Campaign Management Software Increasing the customer lifetime valueIncreasing the customer lifetime value Combining Data Mining and Campaign ManagementCombining Data Mining and Campaign Management
Page Number: 25
DM and Customer Relationship ManagementDM and Customer Relationship Management
Evaluating the benefits of a Data Mining modelEvaluating the benefits of a Data Mining model
Gains chart Profability chart
Page Number: 26
Data Mining ExamplesData Mining Examples
Bass Brewers Bass Brewers “We’ve been brewing beer since 1777, with increased competition “We’ve been brewing beer since 1777, with increased competition comes a demand to make faster better informed decision”comes a demand to make faster better informed decision”
Northern BankNorthern Bank “The “The information is now more accessible, paperless and timely.”information is now more accessible, paperless and timely.”
TSB Group Plc TSB Group Plc “We are “We are using Holos because of its flexibility and its excellent multidimensional using Holos because of its flexibility and its excellent multidimensional database”database”
Page Number: 27
Data Mining ExamplesData Mining Examples
Delphic Universites Delphic Universites “Real value is added to data by multidimensional manipulation “Real value is added to data by multidimensional manipulation (being able to to easily compare many different views (being able to to easily compare many different views of the avaible information in one report) and by modeling.” of the avaible information in one report) and by modeling.”
Harvard - Holden Harvard - Holden “Sybase technology has allowed us to develop an information “Sybase technology has allowed us to develop an information
system that will preserve this legacy into the twenty-first century”system that will preserve this legacy into the twenty-first century” J.P.Morgan J.P.Morgan
“The promise of data mining tools like Information Harvester is “The promise of data mining tools like Information Harvester is that they are able to quickly wade through massive amounts that they are able to quickly wade through massive amounts of data to identify relationships or trending information of data to identify relationships or trending information
that would not have been avaible without the tool”that would not have been avaible without the tool”
Page Number: 28
Case study of Breast Cancer Survival AnalysisCase study of Breast Cancer Survival Analysis
Case study of the influence of various patient characteristics Case study of the influence of various patient characteristics on survival rates for breast canceron survival rates for breast cancer
The survival analysis technique employed is Cox Regression The survival analysis technique employed is Cox Regression (this technique is useful in situations, (this technique is useful in situations,
where some of the patients do not die during the where some of the patients do not die during the observation period)observation period)
Linear regression techniqueLinear regression technique (if all patients had died during the observation period)(if all patients had died during the observation period)
Page Number: 29
Case study of Breast Cancer Survival AnalysisCase study of Breast Cancer Survival Analysis
The observation period runs for 133.8 monthsThe observation period runs for 133.8 months The modeling sample contains 746 patients The modeling sample contains 746 patients
(50 patients died during the observation period and 696 (50 patients died during the observation period and 696 who survived beyond the end of the observation who survived beyond the end of the observation
period)period) In this example, we are testing only four predictors: In this example, we are testing only four predictors:
Age, in years, at the start of the observation period (22 to 88)Age, in years, at the start of the observation period (22 to 88) Pathological tumor size, in centimeters (0.10 to 7.00)Pathological tumor size, in centimeters (0.10 to 7.00) Number of positive axillary lymph nodes (0 to 35)Number of positive axillary lymph nodes (0 to 35) Estrogen receptor status (positive vs. negative)Estrogen receptor status (positive vs. negative)
Page Number: 30
Case study of Breast Cancer Survival AnalysisCase study of Breast Cancer Survival Analysis
The Cox Regression used a backward stepwise likelihood-ratio The Cox Regression used a backward stepwise likelihood-ratio variable selection methodvariable selection method
Significance criteria were set at 0.05 for inclusion in the model, Significance criteria were set at 0.05 for inclusion in the model, and 0.10 for removal from the modeland 0.10 for removal from the model
Printout from the final step of the stepwise regression analysis:Printout from the final step of the stepwise regression analysis:
________________ Variables in the Equation ______________ ________________ Variables in the Equation ______________
Variable B S.E. Wald df Sig R Exp(B)Variable B S.E. Wald df Sig R Exp(B)
AGE -.0314 .0121 6.7486 1 .0094 -.0893 .9691AGE -.0314 .0121 6.7486 1 .0094 -.0893 .9691
PATHSIZE .3975 .1175 11.4476 1 .0007 .1259 1.4881PATHSIZE .3975 .1175 11.4476 1 .0007 .1259 1.4881
LNPOS .1372 .0361 14.4100 1 .0001 .1443 1.1471LNPOS .1372 .0361 14.4100 1 .0001 .1443 1.1471
_______________________________________________________ _______________________________________________________
The column labeled "Sig" shows the statistical significance of included variablesThe column labeled "Sig" shows the statistical significance of included variables
The column labeled "R" shows the degree of unique correlation with the dependent variableThe column labeled "R" shows the degree of unique correlation with the dependent variable
Page Number: 31
Case study of Breast Cancer Survival AnalysisCase study of Breast Cancer Survival Analysis
Some key things to note are: Some key things to note are:
Estrogen status was removed as a predictor because Estrogen status was removed as a predictor because it did not reach the 0.05 significance criterion for inclusion it did not reach the 0.05 significance criterion for inclusion
Number of positive axillary lymph nodes was the strongest Number of positive axillary lymph nodes was the strongest predictor of survival rates predictor of survival rates (R=.1443 / Sig=.0001)(R=.1443 / Sig=.0001), , then follow pathological tumor size then follow pathological tumor size (R=.1259 / Sig.=.0007)(R=.1259 / Sig.=.0007), , over the over the course of the observation periodcourse of the observation period
Age, although significant, is somewhat less influential Age, although significant, is somewhat less influential than the other two predictors than the other two predictors (R=-0.893 / Sig.=.0094)(R=-0.893 / Sig.=.0094)
Note that both the number of positive axillary lymph nodes and Note that both the number of positive axillary lymph nodes and the pathological tumor size are positively correlated, which means the pathological tumor size are positively correlated, which means that they are directly associated with more rapid mortality. that they are directly associated with more rapid mortality.
Age is negatively correlated with the dependent variable, which Age is negatively correlated with the dependent variable, which means that younger age is predictive of somewhat longer survival.means that younger age is predictive of somewhat longer survival.
Page Number: 32
Case study of Breast Cancer Survival AnalysisCase study of Breast Cancer Survival Analysis
All patients survive through All patients survive through the 10 month of the observation the 10 month of the observation periodperiod
At the fortieth month, At the fortieth month, the mortality rate increases and the mortality rate increases and continues at this fairly constant continues at this fairly constant increased rate increased rate through the forty-fifth month through the forty-fifth month
At the forty-fifty month,At the forty-fifty month, there is a five-month period there is a five-month period without additional mortalitywithout additional mortality
11% of the original sample has 11% of the original sample has dieddied
The following chart shows the cumulative The following chart shows the cumulative survival function during the observation period:survival function during the observation period:
Page Number: 33
Case study of Breast Cancer Survival AnalysisCase study of Breast Cancer Survival Analysis
Conclusions and ImplicationsConclusions and Implications
The case study presented here is relatively simple, The case study presented here is relatively simple, and is for illustrative purposes only.and is for illustrative purposes only.
With the addition of more candidate predictors With the addition of more candidate predictors (progesterone receptor status, histologic grade, blood type etc.),(progesterone receptor status, histologic grade, blood type etc.), an even more powerful model could emerge.an even more powerful model could emerge.
By understanding the influence of patient characteristics By understanding the influence of patient characteristics on on mortality rates over time, we are in a better position to estimate mortality rates over time, we are in a better position to estimate survival times for individual patients, and to defend using survival times for individual patients, and to defend using different or more aggressive therapeutic approaches for some different or more aggressive therapeutic approaches for some patients. patients.
Page Number: 34
Securities Brokerage Case StudySecurities Brokerage Case Study
Predictive market segmentation model designed to identify Predictive market segmentation model designed to identify and profile high-value brokerage customer segments and profile high-value brokerage customer segments as targets for special marketing communications efforts. as targets for special marketing communications efforts.
The dependent variable for this ordinal CHAID model The dependent variable for this ordinal CHAID model is brokerage account commission dollars during the past 12 is brokerage account commission dollars during the past 12 monthsmonths
We begin by splitting the client's entire customer file We begin by splitting the client's entire customer file into a modeling sample and a validation sample. into a modeling sample and a validation sample. (Once the (Once the model is built using the modeling sample, model is built using the modeling sample, we we apply it to the validation sample to see how well it works apply it to the validation sample to see how well it works on a sample other than the one on which it was built). on a sample other than the one on which it was built).
Page Number: 35
Securities Brokerage Case StudySecurities Brokerage Case Study
The resulting CHAID model has 55 segments. The resulting CHAID model has 55 segments. However, the results are summarized in the following comb chart, However, the results are summarized in the following comb chart,
showing the segment indexes (indexes of average dollar value)showing the segment indexes (indexes of average dollar value)
Page Number: 36
Securities Brokerage Case StudySecurities Brokerage Case Study
The part of Gains Chart: Average Annual Brokerage Commission DollarsThe part of Gains Chart: Average Annual Brokerage Commission Dollars
… … … … … … … … ...
Gains chart provides Gains chart provides quantitative detail useful quantitative detail useful for financial and marketing for financial and marketing planning.planning.
We have highlighted the We have highlighted the top 20% of the file in bluetop 20% of the file in blue
The top 20% of the file The top 20% of the file is worth an average is worth an average of about $334 per account, of about $334 per account, which is nearly three times which is nearly three times the average account value the average account value for the entire sample.for the entire sample.
Page Number: 37
Securities Brokerage Case StudySecurities Brokerage Case Study
Using the data in the gains chart this information, Using the data in the gains chart this information, we can we can better plan our communications/promotion budget. better plan our communications/promotion budget.
In general, the best segments represent customers In general, the best segments represent customers who are experienced, aggressive, self-directed traders. who are experienced, aggressive, self-directed traders.
The other decisions, which the gains chart The other decisions, which the gains chart and the segmentation rules can help us make:and the segmentation rules can help us make:
We might wish to conduct some market research among customers We might wish to conduct some market research among customers in under-performing segments, or among under-performing customers in under-performing segments, or among under-performing customers in the better segmentsin the better segments
We can use the segment definitions to help us identify possible issues We can use the segment definitions to help us identify possible issues and and question areas to include in the surveyquestion areas to include in the survey
Before we try to apply such a model, we perform a validation Before we try to apply such a model, we perform a validation against a holdout sample, to confirm that it is a good model. against a holdout sample, to confirm that it is a good model.
Page Number: 38
T h e E n dT h e E n d