1 a brief introduction to crisp-dm. the hard facts about data enormous amounts of data are being...
TRANSCRIPT
The Hard Facts About Data
• Enormous amounts of data are being stored in databases
• Businesses are increasingly becoming data-rich, yet, paradoxically, they remain knowledge-poor
“We are drowning in information, but starving for knowledge” -John Naisbett
• Unless it is used to improve business practices, data is a liability, not an asset
• Standard data analysis techniques are useful but insufficient and may miss valuable insight
Real Examples
• Consider the enormous amounts of data generated Transactional data by credit card companies Searches on Google, Yahoo, and MSN Clickstream (web) or other sensor data Europe's Very Long Baseline Interferometry (VLBI) has 16
telescopes, each of which produces 1 Gigabit/second of astronomical data over a 25-day observation session
• storage and analysis are a big problem Walmart reported to have 24 Tera-byte DB (likely even larger now)
AT&T handles billions of calls per day• data cannot be stored -- analysis must be done on the fly
Social media data
What Is Data Mining?Business Definition
• Deployment of business processes, supported by adequate analytical techniques, to:
Take further advantage of data Discover RELEVANT knowledge ACT on the results
KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
Application Domains (I)
• Direct marketing and retail Behavior analysis, Offer targeting, Market basket
analysis, Up-selling, etc.
• Banks and financial institutions Credit risk assessment, Fraud detection, Portfolio
management, Forecasting, etc.
• Telecommunications Churn prediction, Product/service development,
campaign management, fraud detection, etc.
Application Domains (II)
• Healthcare Public health monitoring (infectious outbreaks, etc),
Outcomes measurement (performance, cost, success rate, etc), Diagnostic help, etc.
• Pharmaceutical industry / Bio-informatics Biological activity prediction, Coding sequence
discovery, Animal tests reduction, etc.
• Insurances Cross-selling, Risk analysis, Premium setting, Claims
analysis, Fraud detection, etc.
Application Domains (III)
• Transports Network management, Booking optimization,
Customer service, etc.
• Manufacturing Load forecasting, Production management, Equipment
monitoring, Quality management, etc.
• Etc.
Multidisciplinary
DatabasesStatistics
VisualizationMachine Learning
Data Mining and Knowledge Discovery
Business/DomainKnowledge
Data Mining Tasks
• Summarization
• Classification / Prediction Classification, Concept learning, Regression
• Clustering
• Dependency modeling
• Anomaly detection
• Link Analysis
Summarization
• To find a compact description for a subset of the data. Producing the average down time of all plant
equipments in a given month, computing the total income generated by each sales representative per region per year
• Techniques: Statistics, Information theory, OLAP, etc.
Prediction
• To learn a function that associates a data item with the value of a response variable. If the response variable is discrete, we talk of classification learning; if the response variable is continuous, we talk of regression learning. Assessing credit worthiness in a loan underwriting business,
assessing the probability of response to a direct marketing campaign
• Techniques: Decision trees, Neural networks, Naïve Bayes, Support
vector machines, Logistic regression, Nearest-neighbors, etc.
Clustering
• To identify a set of (meaningful) categories or clusters to describe the data. Clustering relies on some notion of similarity among data items and strives to maximize intra-cluster similarity whilst minimizing inter-cluster similarity. Segmenting a business’ customer base, building a taxonomy
of animals in a zoological application
• Techniques: K-Means, Hierarchical clustering, Kohonen SOM, etc.
Dependency Modeling
• To find a model that describes significant dependencies, associations or affinities among variables. Analyzing market baskets in consumer goods
retail, uncovering cause-effect relationships in medical treatments
• Techniques: Association rules, ILP, Graphical modeling, etc.
Anomaly Detection
• To discover the most significant changes in the data from previously measured or normative values. Detecting fraudulent credit card usage, detecting
anomalous turbine behavior in nuclear plants
• Techniques: Novelty detectors, Probability density models, etc.
Data Mining Process
• CRISP-DM: Cross-Industry Standard Process for Data Mining
• Consortium effort involving: NCR Systems Engineering Copenhagen DaimlerChrysler AG SPSS Inc. OHRA Verzekeringen en Bank Groep B.V
• History: Version 1.0 released in 1999 See www.crisp-dm.org for further details
Collect Initial DataInitial Data Collection Report
Describe DataData Description Report
Explore DataData Exploration Report
Verify Data Quality Data Quality Report
Summary: Phases & Tasks
BusinessUnderstanding
DataUnderstanding EvaluationData
PreparationModeling
Determine Business ObjectivesBackgroundBusiness ObjectivesBusiness Success Criteria
Situation AssessmentInventory of ResourcesRequirements, Assumptions, and ConstraintsRisks and ContingenciesTerminologyCosts and Benefits
Determine Data Mining GoalData Mining GoalsData Mining Success Criteria
Produce Project PlanProject PlanInitial Asessment of Tools and Techniques
Data SetData Set Description
Select Data Rationale for Inclusion / Exclusion
Clean Data Data Cleaning Report
Construct DataDerived AttributesGenerated Records
Integrate DataMerged Data
Format DataReformatted Data
Select Modeling TechniqueModeling TechniqueModeling Assumptions
Generate Test DesignTest Design
Build ModelParameter SettingsModelsModel Description
Assess ModelModel AssessmentRevised Parameter Settings
Evaluate ResultsAssessment of Data Mining Results w.r.t. Business Success CriteriaApproved Models
Review ProcessReview of Process
Determine Next StepsList of Possible ActionsDecision
Plan DeploymentDeployment Plan
Plan Monitoring and MaintenanceMonitoring and Maintenance Plan
Produce Final ReportFinal ReportFinal Presentation
Review ProjectExperience Documentation
Deployment
CRISP-DM Phases
• Business Understanding Initial phase Focuses on:
• Understanding the project objectives and requirements from a business perspective
• Converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives
• Data Understanding Starts with an initial data collection Proceeds with activities aimed at:
• Getting familiar with the data• Identifying data quality problems• Discovering first insights into the data• Detecting interesting subsets to form hypotheses for hidden information
CRISP-DM Phases
• Data Preparation Covers all activities to construct the final dataset (data that will be fed
into the modeling tool(s)) from the initial raw data Data preparation tasks are likely to be performed multiple times, and
not in any prescribed order Tasks include table, record, and attribute selection, as well as
transformation and cleaning of data for modeling tools
• Modeling Various modeling techniques are selected and applied, and their
parameters are calibrated to optimal values Typically, there are several techniques for the same data mining
problem type Some techniques have specific requirements on the form of data,
therefore, stepping back to the data preparation phase is often needed
CRISP-DM Phases
• Evaluation At this stage, a model (or models) that appears to have
high quality, from a data analysis perspective, has been built
Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives
A key objective is to determine if there is some important business issue that has not been sufficiently considered
At the end of this phase, a decision on the use of the data mining results should be reached
CRISP-DM Phases
• Deployment Creation of the model is generally not the end of the project Even if the purpose of the model is to increase knowledge of the data,
the knowledge gained will need to be organized and presented in a way that the customer can use it
Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process
In many cases it will be the customer, not the data analyst, who will carry out the deployment steps
However, even if the analyst will not carry out the deployment effort it is important for the customer to understand up front what actions will need to be carried out in order to actually make use of the created models
The Missing Link
Monitoring
Closing the Loop
Changes in dataChanges in environment
How do I know my model remains valid and applicable?
When should I update my model(s)?
How do I update my model(s)?
Data Mining Myths (I)
• Data Mining produces surprising results that will utterly transform your business Reality:
• Early results = scientific confirmation of human intuition.
• Beyond = steady improvement to an already successful organisation.
• Occasionally = discovery of one of those rare « breakthrough » facts.
• Data Mining techniques are so sophisticated that they can substitute for domain knowledge or for experience in analysis and model building Reality:
• Data Mining = joint venture.
• Close cooperation between experts in modeling and using the associated techniques, and people who understand the business.
Data Mining Myths (II)
• Data Mining is useful only in certain areas, such as marketing, sales, and fraud detection Reality:
• Data mining is useful wherever data can be collected.
• All that is really needed is data and a willingness to « give it a try. » There is little to loose…
• Only massive databases are worth mining Reality:
• A moderately-sized or small data set can also yield valuable information.
• It is not only the quantity, but also the quality of the data that matters (characterising mutagenic compounds)
Data Mining Myths (III)
• The methods used in Data Mining are fundamentally different from the older quantitative model-building techniques Reality:
• All methods now used in data mining are natural extensions and generalisations of analytical methods known for decades.
• What is new in data mining is that we are now applying these techniques to more general business problems.
• Data Mining is an extremely complex process Reality:
• The algorithms of data mining may be complex, but new tools and well-defined methodologies have made those algorithms easier to apply.
• Much of the difficulty in applying data mining comes from the same data organisation issues that arise when using any modeling techniques.
Food for Thought
• “Data mining can't be ignored -- the data is there, the methods are numerous, and the advantages that knowledge discovery brings to a business are tremendous.”
• “People who can't see the value in data mining as a concept either don't have the data or don't have data with integrity.”
• “Data mining is quickly becoming a necessity, and those who do not do it will soon be left in the dust. Data mining is one of the few software activities with measurable return on investment associated with it.”
Data Mining Deliverables
• Provides additional insight about the data and the business
• Provides scientific confirmation of empirical/intuitive business observations
• Discovers new, subtle pieces of business knowledge
In that order !
Key Success Factors
• Have a clearly articulated business problem that needs to be solved and for which Data Mining is the adequate technology
• Ensure that the problem being pursued is supported by the right type of data of sufficient quality and in sufficient quantity
• Recognise that Data Mining is a process with many components and dependencies
• Plan to learn from the Data Mining process whatever the outcome
Conclusion
• Data Mining transforms data into actions• Data Mining is hard work
It is a process, not a single activity Most companies are clueless and DM is an
afterthought Plan to learn through the process Think big, start small
• Data Mining is FUN!
More on Data Mining
• KDnuggets News, software, jobs, courses, etc.www.KDnuggets.com
• ACM SIGKDD Data mining associationwww.acm.org/sigkdd
The Situation
• Potential applications: Associations of products that sell together Segmentation of customers
• Short audit: Nice DWH, only 2 years old, not fully
populated Limited data on purchases and subscriptions
Summarization / Aggregation
• Revenue distribution 80% generated by 41.5% of subscribers 60% generated by 18.3% of subscribers 42.9% generated by top 5 products
• Simple customer classes Over 65 years old most profitable Under 16 years old least profitable
• Birthdate filled-in for only about 10% of subscribers!
Product Association
• About 21% of subscribers buy P4, P7 and P9 P4 is most profitable product P7 is ranked 6th P9 is ranked 15th with only 2%
of revenue
• Several possible actions Make a bundle offering of these products Cross-sell from P9 to P4 Temptation to remove P9 should be resisted
Summary of Findings
• Data Mining found: A small percentage of the customers is responsible for a large
share of the sales Several groups of « strongly-connected » articles A sizeable group of subscribers who buy a single article
• What was learned? First 2 findings: « we knew that! » (BUT: scientific confirmation
of business observation) 3rd finding: « we could target these customers with a special
offer! » Lack of relevant data: the structure is in place but not being used
systematically
Finding Affinities
Metrics generally depend on the nature of the attribute (e.g.,
nominal, real, string)
Total Affinities(Thicker lines indicate stronger affinities --- Highly connected group )
CHARACTERISTICSName, Sex, Hometown, Occupation,
Political Affiliation, Children
More Than 2 Affinities
Seems to be an important link
CHARACTERISTICSName, Sex, Hometown, Occupation,
Political Affiliation, Children
Occupational Affinity Network
Moisture FarmersMoisture Farmers
Jedi KnightsJedi KnightsStronger Affinity
between Luke and Obi-Wan because
they were both Jedi Knights and
Jedi Masters
Birthday Networks(Two or more affinities)
Twins!
Close relatives that share birthdays
Duplicate individual
Given Name Network (One or more affinities)
Interesting! both husband and wife’s maternal grandfathers
share the same first and middle names.
InterestingNaming Pattern
Through generations
More neatNaming Patterns…
Relatives sharing the same middle names
Record Linkage
• The process of identifying similar people
• Essential for exchanging and/or merging pedigrees
• MAL4:6 uses the individuals and their relatives as found in their pedigrees
Challenges
• Each relationship/attribute is treated equally• Weights
Version 0.1 used feature selection instead of continuous weights
Weights would allow MAL4:6 to use all of the data in a pedigree to a degree (TBD by MAL4:6)
• Naturally Skewed Data #NonMatches >> #Matches Learners tend to over learn the majority class
Similarity
• Attributes: A = {A1,A2,…An}, Ai would be a piece of information (e.g., date of birth)
• For each Ai, simAi is the similarity metric associated with Ai
• Let x = < A1 : a1x, A2 : a2
x,…, An : anx > denote an individual where aj
x
is the value of Aj for x <firstname: John, lastname: Smith,…>
• Let R= {R0,R1,…Rm} be a set of functions that map an individual to one of its relatives
Results
• Genealogical database from the LDS Church’s Family History Department (~5 million individuals)
• ~16,000 labeled data instances
Precision: 88.9% Recall: 93.8%
Search Term Analysis
• Prior to April 2005 Search terms used prior to April
contained very few unique keywords
Most common keywords used were words in the actual domain name
Significant surge in April 2005Significant surge in April 2005 Diversification of the search terms, often corresponding to new Diversification of the search terms, often corresponding to new
products/offersproducts/offers Doubling of number of unique visitorsDoubling of number of unique visitors
What happened? What happened? Search Engine Optimization (SEO)Search Engine Optimization (SEO)!!
Shipping Policy
• August 2005 Change shipping policy Highly visible, lower, free+
• Impact on abandoned carts? Not significant
Before-After PurchasesBefore-After Purchases Marked increase in number of Marked increase in number of
purchases in all categoriespurchases in all categories 100% increase for high-end 100% increase for high-end
category (free shipping)category (free shipping) Can’t infer causality BUT clear Can’t infer causality BUT clear
indication of some effectindication of some effect