business intelligence 3. data mining - kocwcontents.kocw.net/kocw/document/2014/koreasejong/... ·...
TRANSCRIPT
Data Mining
Definition
“Computational process of extracting interesting (non-
trivial, implicit, previously unknown and potentially
useful) patterns or knowledge from huge amount of
data using methods from artificial intelligence, machine
learning, statistics, and database systems.”
Alternative names
◦ Knowledge discovery in databases (KDD), knowledge
extraction, data/pattern analysis, business intelligence,
big data, etc.
2
Background
Explosive Growth of Data
◦ Transactions, web, e-commerce, mobile computing, sensors
(weather sensors, closed caption cameras), etc.
We are drowning in data, but starving for knowledge!
To mine knowledge from abundant data, we need
automated analysis
One of the technologies
for Big Data
3
A Typical Data Mining System
data cleaning, integration, and selection
Database or Data Warehouse Server
Data Mining Engine
Pattern Evaluation
Visualization
Database Data
Warehouse
World
Wide
Web
Other Info
Repositories
4
Data Mining is a Confluence of Multiple Disciplines
Data Mining
Database Technology Statistics
Machine Learning
Pattern Recognition
Algorithm
Other Disciplines
Visualization
5
Data Mining Classifications
Pattern Mining and Trend Analysis
◦ Association, regression
E.g. Diaper Beer [2%, 75%]
◦ Trend? common temporal behaviors
Classification (symbolic) and prediction (numeric)
◦ E.g., classify a given country based on its ‘climate’ (into tropical climate,
temperate climate, polar climate country)
◦ E.g., predicting temperatures and plants in a country.
Cluster analysis
◦ Group data into different clusters that maximize intra-class similarity
interclass dissimilarity
E.g., cluster animals into various classes (mammal, fish, etc.)
Outlier analysis
◦ Find data object not complying with general characteristics
E.g., Credit card fraud, call fraud, money laundry, etc.
6
Business Applications of DM
Where does the business data come from?
◦ Credit card transactions, loyalty cards, discount coupons, customer complaint calls, (public) lifestyle studies, mobile devices, sensors, etc.
Where does DM applied?
◦ Customer Clustering
◦ Customer Profiling
◦ Determine customer purchasing patterns
◦ Cross-market analysis
◦ Predict what factors will attract new customers
◦ Fraud detection: auto insurance, medical insurance, phone-call frauds,
or money laundering detection
◦ Anti-terrorism
7
Classification and Prediction
Classification (symbolic)
◦ Construct classification model using training set (having values and class labels), then, classify new instances into categories using the model
◦ Techniques: if/then rules, decision trees, artificial neural net
Prediction (numeric)
◦ Predicts unknown or missing numeric values
◦ Construct models of continuous-valued functions, then use it for predictions
◦ Techniques: regression
Typical applications of classification and prediction
◦ Credit approval, Target marketing, Medical diagnosis, Fraud detection, etc.
9
Prediction
Predict a numeric value for a target variable given various condition variable values.
Two Step Process
1. Construct a model
2. Use model to predict a value for a given input
Major method for prediction: regression
◦ Model the relationship between one or more independent or predictor variables and a dependent or response variable
Regression analysis
◦ Linear regression
◦ Multiple linear regression
◦ Non-linear regression
◦ Etc.
Linear Regression
Linear regression: involves a response variable y and a single
predictor variable x
where w0 (y-intercept) and w1 (slope) are regression coefficients
Method of least squares: estimates the best-fitting straight line
||
1
2
||
11
)(
))((
D
i
i
D
i
ii
xx
yyxx
w
xwyw 10
xwwy 10
11
Multiple Linear Regression
Multiple linear regression: involves more than one predictor
variable
◦ Training data is in the form of (x1, x2, y)
Ex. For 2-D data
◦ Solvable by extension of least square method for y = w0 + w1 x1+ w2 x2
12
Nonlinear Regression
Some nonlinear models can be modeled by a polynomial function
A polynomial regression model can be transformed into multiple
linear regression model.
For example, y = w0 + w1 x + w2 x2 + w3 x
3
can be converted to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
Other functions, such as power function, can also be transformed to
linear model
13
Accuracy Measures
C ~C Total (fact)
C True positive False negative Positive
~C False positive True negative Negative
Total (prediction) Pos-predictions Neg-predictions Total
Accuracy measures
◦ accuracy = (t-pos + t-neg) / (pos + neg)
◦ error-rate = 1 - accuracy
◦ sensitivity = t-pos/pos
(identifying cows with Mad Cow Disease - find all positive cases. )
◦ specificity = t-neg/neg
(identifying professional patients - find all negative cases)
(fact)
(fact)
(prediction)
fact
prediction
14