data mining dr. chang liu. what is data mining data mining has been known by many different terms...

18
Data Mining Dr. Chang Liu

Upload: barbra-baldwin

Post on 11-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge

Data Mining

Dr. Chang Liu

Page 2: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge

What is Data Mining Data mining has been known by many different terms

• Knowledge Discovery in Database (KDD)• Predictive Analytics• Machine Learning• Business Analytics

It is the process of finding hidden patterns in data• For example, what is the profile of people who buy from us?

Usage of data mining has become widespread recently for various reasons

Typically, businesses find huge increases in profitability as a result of applying data mining

Page 3: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge

Some Common Problems Growing business by cross-selling

• A retailer can use buying patterns of customers to generate recommendations for new customers

Determine risk of giving a loan to a particular customer• Profiles of customers who have defaulted in the past are

learned and used with new customers

Forecast the likely unemployment level based on its past trend

Is a credit card transaction likely to be a fraudulent?

Is this tumor in a patient’s breast likely malignant?

Page 4: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge

Data Mining Tasks Data mining problems are solved by performing a

specific task:• Given a problem, an analyst should first determine the data

mining task that should be performed. I need to determine whether a customer is likely to

default a loan. I can solve this by performing a classification task

There are a number of tasks:• Classification• Association or market basket analysis• Forecasting• Deviation Analysis• Clustering or segmentation• Sequence analysis• Regression

Page 5: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge

Data Mining Tasks (cont.) Classification is used to predict which of a few

known outcomes a case is likely to be• Is this customer likely to default? Has two known outcome

“Yes” or “No” Association is used to analyze transaction tables and

determine which items in the transaction table tend to go together. Example?

Forecasting is used to generate new data points in a time series. Example?

Deviation analysis is used to determine anomalous data points or outliers• Used by security experts to detect network intrusion attacks• Used by insurance companies and credit card companies to detect fraud

Page 6: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge

Data Mining Tasks (cont.) Clustering or segmentation is used to discover

natural grouping in data

Sequence analysis discovers sequence patterns in events• E.g., purchase of a computer is followed by purchase of a

printer, then webcam …• Used by marketing folks to understand and exploit buying

habits• Used to analyze web clickstream data

Regression is used to predict numerical values

Page 7: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge

Data Ming Algorithms Microsoft SSAS provides the following data

mining algorithms:

• Microsoft Decision Trees• Microsoft Neural Network• Microsoft Naïve Bayes• Microsoft Association Rules• Microsoft Time Series• Microsoft Clustering• Microsoft Sequence Clustering• Microsoft Linear Regression• Microsoft Logistic Regression

Page 8: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge

Case The thing you are mining or asking questions

about is called a case

• The case is often a row in a table; e.g., when studying which customers are likely to default on a loan, each row in the customer table is a case

• Transaction tables are an example of nested cases

Page 9: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge

Attributes / Case Key Attributes are the variables that are used in

the data mining analysis.• Attributes are often columns in the case table

An attribute can be input or an output• At modeling time, both input and output attributes are

provided• At the prediction time, input attributes are used to predict

output attributes

Case Key indicates the identity of the case• This is often the primary key or a row index

Page 10: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge

Mining Structure / Mining Model

A mining structure is a table that contains the columns to be analyzed. It also contains data mining models used to analyze the data.

Mining model defines how the problem is to be modeled.• Specify which columns to be included in the model• Specify the algorithm to be used• Define which columns are input and which are

output

Page 11: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge

Training Models Many data mining algorithms requires

historical data to learn patterns from

Training the model is also known as processing the model

Typically, not all available historical data is used to train the model• A percentage is left for testing purpose. This set is

called the testing set• The data is used to train the model is called the

training set

Page 12: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge

Class Activity_1 High school student historical data –

CollegePlan table from DB661

You are asked to find out what factors influence a high school student to go to college (or not)

What data mining task would you perform? What is the case in this case? What is the case key? What algorithm(s) is/are applicable for this task? Which attribute(s) is/are input? Which attribute(s) is/are output?

Page 13: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge

Class Activity_2

Explore vmMSFTYear2008 data in DB661

Predict Microsoft stock values in the first week of 2009 (The real data is available at vmMSFTFirstWeek2009)

Can you make money from MSFT based on your data mining knowledge?

Page 14: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge

QUESTIONS??

Page 15: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge
Page 16: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge
Page 17: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge

With a new student table

Page 18: Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge

Results