slide 1 dsci 4520/5240: data mining fall 2013 – dr. nick evangelopoulos lecture 1: introduction to...
TRANSCRIPT
slide 1
DSCI 4520/5240: Data MiningFall 2013 – Dr. Nick Evangelopoulos
Lecture 1:
Introduction to Data Mining
Some slide material based on:Groth; Han and Kamber; Cerrito; SAS Education
slide 2
DSCI 4520/5240DATA MINING
ITDS Résumé Book
ITDS majors (BCIS/DS), please send your résumé to [email protected], so that we can include it to the ITDS Résumé Book we send to our corporate partners for hiring/coop consideration. Make sure the résumés are formatted per UNT standards. Here is a link to the sample résumés: https://unt.optimalresume.com/
slide 3
DSCI 4520/5240DATA MINING
Data (and the lack thereof)
(Sir Arthur Conan Doyle: Sherlock Holmes, "A Scandal in Bohemia") http://www.dilbert.com/2012-12-05/
“It is a capital mistake to theorize before one has data.
Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”
slide 5
DSCI 4520/5240DATA MINING
Nobel Laureate Calls Data Mining "A Must"
In an interview with ComputerWorld in January 1999, Dr. Penzias (won the 1978 Nobel Prize in physics and was the vice president and chief scientist at Bell Laboratories) considered large scale data mining from very large databases as the key application for corporations in the next few years.
In response to ComputerWorld's age-old question of "What will be the killer applications in the corporation?" Dr. Penzias replied:
"Data mining." He then added: "Data mining will become much more important and companies will throw away nothing about their customers because it will be so valuable. If you're not doing this, you're out of business" he said.
slide 6
DSCI 4520/5240DATA MINING
What Is Data Mining?
Data mining (knowledge discovery in databases):
A process of identifying hidden patterns and relationships within data (Groth)
Data mining:
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases
slide 7
DSCI 4520/5240DATA MINING
Motivation: “Necessity is the Mother of Invention”
Data explosion problem
Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories
Problem: We are drowning in data, but starving for knowledge!
Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
slide 8
DSCI 4520/5240DATA MINING
elec
tron
ic p
oint
-of-s
ale
data
hosp
ital p
atie
nt reg
istr
ies
cata
log
orde
rs
ban
k tr
ansa
ctio
ns
rem
ote
sens
ing
imag
es
tax
retu
rns
airli
ne res
erva
tions
c
redi
t car
d ch
arge
s
stoc
k tr
ades
O
LTP
tel
epho
ne c
alls
Data Deluge
slide 9
DSCI 4520/5240DATA MINING
Data Mining, circa 1963
IBM 7090 600 cases
“Machine storage limitationsrestricted the total number ofvariables which could beconsidered at one time to 25.”
“Machine storage limitationsrestricted the total number ofvariables which could beconsidered at one time to 25.”
slide 10
DSCI 4520/5240DATA MINING
Business Decision Support
Database Marketing
– Target marketing
– Customer relationship management
Credit Risk Management
– Credit scoring
Fraud Detection Healthcare Informatics
– Clinical decision support
slide 12
DSCI 4520/5240DATA MINING
Multidisciplinary
Databases
Statistics
PatternRecognition
KDD
MachineLearning AI
Neurocomputing
Data Mining
slide 13
DSCI 4520/5240DATA MINING
What Is Data Mining?
IT: Complicated database queries
ML: Inductive learning from examples
Stat: What we were taught not to do
slide 16
DSCI 4520/5240DATA MINING
...
Predictive Modeling
......
......
......
......
......
...
...
...
...
...
...
...
...
Inputs
Cases
Target
...
...
slide 17
DSCI 4520/5240DATA MINING
Types of Targets
Supervised Classification– Event/no event (binary target)
– Class label (multiclass problem)
Regression– Continuous outcome
Survival Analysis– Time-to-event (possibly censored)
slide 18
DSCI 4520/5240DATA MINING
Why Data Mining? — Potential Applications
Database analysis and decision support Market analysis and management
– target marketing, customer relation management, market basket analysis, cross selling, market segmentation
Risk analysis and management
– Forecasting, customer retention, improved underwriting, quality control, competitive analysis
Fraud detection and management
Other Applications Text mining (news group, email, documents) and Web
analysis. Intelligent query answering
slide 19
DSCI 4520/5240DATA MINING
Market Analysis and Management (1)
Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies
Target marketing
Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.
Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information
slide 20
DSCI 4520/5240DATA MINING
Market Analysis and Management (2)
Customer profiling
data mining can tell you what types of customers
buy what products (clustering or classification)
Identifying customer requirements
identifying the best products for different customers
use prediction to find what factors will attract new
customers
slide 21
DSCI 4520/5240DATA MINING
Corporate Analysis and Risk Management
Finance planning and asset evaluation
cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)Resource planning:
summarize and compare the resources and spendingCompetition:
monitor competitors and market directions group customers into classes and a class-based pricing
procedure set pricing strategy in a highly competitive market
slide 22
DSCI 4520/5240DATA MINING
Other Applications
Sports
IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat
Astronomy
JPL and the Palomar Observatory discovered 22 quasars with the help of data mining
Internet Web Surf-Aid
IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
slide 23
DSCI 4520/5240DATA MINING
On the News:Rexer Analytics Annual Data Mining survey
The 2013 survey will become available in Fall 2013 (stay tuned)
slide 24
DSCI 4520/5240DATA MINING
Rexer Analytics 2011 Survey Overview
• SURVEY & PARTICIPANTS: 52-item survey of data miners, conducted on-line in 2011. Participants: 1,319 data miners from over 60 countries.
• FIELDS & GOALS: CRM/Marketing has been the #1 field for the past five years. “Improving the understanding of customers”, “retaining customers” and other CRM goals continue to be the primary goals.
• ALGORITHMS: Decision trees, regression, and cluster analysis continue to form the top three algorithms for most data miners. A third of data miners currently use text mining and another third plan to do so in the future.
• TOOLS: R continued its rise this year and is now being used by close to half of all data miners (47%). R users prefer it for being free, open source, and having a wide variety of algorithms. STATISTICA is selected as the primary data mining tool (17%). STATISTICA, KNIME, Rapid Miner and Salford Systems received the strongest satisfaction ratings.
• ANALYTIC CAPABILITY AND SUCCESS MEASUREMENT: Only 12% of corporate respondents rate their company as having very high analytic sophistication. Measures of analytic success: Return on Investment (ROI), and predictive validity or accuracy of their models. Challenges to measuring success: user cooperation and data availability/quality.
slide 25
DSCI 4520/5240DATA MINING Where Data Miners Work
Data Mining is everywhere!
Data miners also report working in Non-profit (6%), Hospitality / Entertainment / Sports (3%), Military / Security (3%), and Other (9%).
© 2012 Rexer Analytics
slide 27
DSCI 4520/5240DATA MINING The positive impact of Data Mining
In the 5th Annual Survey (2011) of Rexer Analytics (1,319 participant data miners from over 60 countries) data miners shared examples of situations where data mining is having a positive impact on society. The five areas mentioned most often were:
Health / Medical ProgressBusiness ImprovementsPersonalized Communications & MarketingFraud DetectionEnvironmental
slide 28
DSCI 4520/5240DATA MINING
Text Miners
Plan to Start Text Mining
No Plans to Conduct Text
Mining
34%
33%
33%
Text MaterialCustomer / market surveys 38%Blogs and other social media 33%E-mail or other correspondence 27%News articles 25%Scientific or technical literature 23%Web-site feedback 22%Online forums or review sites 21%Contact center notes or transcripts 16%Employee surveys 15%Insurance claims or underwriting notes 15%Medical records 11%Point of service notes or transcripts 10%
The rise of Text Mining
© 2012 Rexer Analytics
slide 29
DSCI 4520/5240DATA MINING
• The average data miner reports using 4 software tools.
• R is used by the most data miners (47%).Overall Corporate Consultants Academics NGO / Gov’t
Data Mining Software
29© 2012 Rexer Analytics
slide 30
DSCI 4520/5240DATA MINING Satisfaction with Data Mining Tools
Extremely SatisfiedExtremely Dissatisfied
© 2012 Rexer Analytics
slide 31
DSCI 4520/5240DATA MINING Measuring Analytic Success
© 2012 Rexer Analytics
53
0 10
Number of respondents
50
60
Model Performance (Accuracy, F, ROC, AUC, Lift)
Financial Performance (ROI, etc.)Performance in Control or Other Group
Feedback from User / Client / Management
Cross-Validation
20
30
40
43
35
29
14
Question: Please share your best practices concerning how you measure analytic project performance / success. (text box provided for response)
slide 32
DSCI 4520/5240DATA MINING Overcoming Data Mining challenges
In the four annual data miner surveys, these key challenges have been identified by data miners more than any others:
Dirty DataExplaining Data Mining to OthersUnavailability of Data / Difficult Access to Data
slide 33
DSCI 4520/5240DATA MINING
Data Mining: A KDD Process
Data mining: the core of knowledge discovery process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
slide 34
DSCI 4520/5240DATA MINING
Steps of a KDD Process
Learning the application domain: relevant prior knowledge and goals of application
Creating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing data mining algorithms summarization, classification, regression, association, clustering.
Data mining: search for patterns of interestPattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
slide 35
DSCI 4520/5240DATA MINING Data Mining and Business Intelligence
Increasing potentialto supportbusiness decisions
End User
Business Analyst
DataAnalyst
DBA
MakingDecisions
Data Presentation
Visualization Techniques
Data MiningInformation Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data SourcesPaper, Files, Information Providers, Database Systems, OLTP