2015/12/13data mining1 what is cluster analysis? (1/4) cluster: a collection of data objects (...

112/04/21 Data Mining 1

What is Cluster Analysis? (1/4)• Cluster: a collection of data objects ( 物以類

聚 )– Similar to one another within the same cluster– Dissimilar to the objects in other clusters

• Cluster analysis– Grouping a set of data objects into clusters– 將一異質的群體 (a diverse group) 區隔為同質性較高的群

集 (clusters 叢聚 ) 或是子群 (subgroups)

• Clustering is unsupervised classification: no predefined classes– 資料依照本身的自我相似性 (self-similarity) 而群集在一

起，群集 (clusters) 的意義要靠事後的闡釋才能得知。


What is Cluster Analysis? (2/4) 找出隱藏的現象或內部結構


What is Cluster Analysis? (3/4) Typical applications

As a stand-alone tool to get insight into data distribution

As a preprocessing step for other algorithms

− clustering might be the first step in a market segmentation effort

a one-size-fits-all rule for “what kind of promotion do customers respond to best” (x)

what kind of promotion works best for each cluster (with similar buying habit) (o)


線上購物網站的使用者族群與消費能力– 具有類似基本資料的人，通常也有相近的行為模式

會員年齡平均月收入 ( 千 )

1 20 20

2 21 26

3 22 25

4 41 30

5 43 32

6 52 40

7 55 38

年齡與平均月收入散佈圖

0

10

20

30

40

50

0 10 20 30 40 50 60年齡

平均

月收

入(千

)

C1

C2

C3

What is Cluster Analysis? (4/4)


What Is Good Clustering? (1/2) A good clustering method will produce high quality

clusters with– high intra-class similarity and low inter-class similarity

The quality of a clustering result depends on both the similarity measure used by the method and its implementation.

The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.− 在十數個刷卡行為的群集中，出現一個群集含有高比例

的呆帳案例，而其他群集毫無特色可言


What Is Good Clustering? (2/2)


根據甚麼資訊 ( 特徵，屬性 ) 來分群事先決定 cluster 的數目是一件困難的工作 data 屬於那個 cluster 應該是程度的問題 (fuzzy)

而非是或否的問題 (crisp)

非監督式學習沒有所謂最佳的模型視覺化工具 vs 分群演算法 ( 專家經驗 )

Cluster Analysis 的議題


A scatter graph helps to understand and visualize clusters of customers (1/2)


Each Axis a purchase of an item associate with that pet

The box at the intersection the number of customers who purchased the

corresponding items

Four segments of customers

1. Only-dog-owners

2. Only-cat-owners

3. Only-fish-owners and cat-and-dog-owners

4. The rest can be lumped together as “others”

A scatter graph helps to understand and visualize clusters of customers (2/2)


Cluster Analysis based on RFM (1/2) 透過 RFM 值的分析可以量化顧客消費行為並且衡量顧客忠誠度和貢獻度，以利顧客分群及目標客戶的鎖定 R(Recency): 最近購買日

the time period since the last purchase;

F(Frequency): 購買頻率 the number of purchases made in a certain time period;

M(Monetary): 購買金額 the amount of money spent during a certain period of time.


Cluster Analysis based on RFM (2/2) 取得某一時間區間內客戶們的 RFM 值進行叢聚分析 Average RFM values of each cluster (Vc) are compared

with the total average RFM values of all clusters (Vt) if vc > vt then give else give

目標客戶與行銷策略 R F M : Promising customers R F M : Loyal customers R F M : Vulnerable customers

有些變化的組合很難去解釋、以及變化的幅度未考量


Examples of Clustering Applications• Marketing: Help marketers discover distinct groups in

their customer bases, and then use this knowledge to develop targeted marketing programs

• Land use: Identification of areas of similar land use in an earth observation database

• Insurance: Identifying groups of motor insurance policy holders with a high average claim cost

• City-planning: Identifying groups of houses according to their house type, value, and geographical location

• Text Mining: 文件分類、客服申訴處理、病人病例分析、軍事刑事情報管理（關鍵字結構的相似性）


Data Classification 與 Data Clustering 之比較 Data Classification

– 是根據資料的屬性和一些預先建立的規則 (Rule) 來將資料分類

– 事前必先對資料的結構有一定的了解才能實行– 找出許多 ( 輸入 ) 變數與命題 ( 輸出變數 ) 之間的關連性

Data Clustering– 它不需要了解資料庫中的資料特色和結構，就能把資料分

類成群 – 讓群組內的資料相似度最高，讓群組跟群組間的資料相似

度最低– 呈現變數之間的結構，有比較多的詮釋空間


Description and Visualization (1/2)

描述在複雜的資料庫中到底發生了什麼 ? 透過這種方式，可以讓我們對我們的客戶、產品以及流程等有更多的認識與了解。

− A good enough description of a behavior will often suggest

an explanation for it as well parental movie viewing habits are strongly influenced by the taste of

children


Description and Visualization (2/2)

Data visualization is one powerful form of

descriptive data mining.− It is not always easy to come up with meaningful visualizations,

but the right picture really can be worth a thousand association

rules

− Data Cube, Scatter graph, Histogram, …


資料探勘的技術統計分析 (Statistic Analysis) 關聯分析 (Association Analysis) 分類法 (Classification) 叢聚分析 (Clustering Analysis) 其他的技術

– 趨勢分析 (Trend Analysis) 、時間序列分析 (Time Serial Analysis) 、迴歸分析 (Regression Analysis) 、異常值分析 (Outlier Analysis) 或是人工智慧領域中的類神經網路 (Neural Network)技術……等。


• We wondered what movies a person watches Who goes to see a movie

• The moviegoers database contains the responses to an informal survey conducted during August and September of 1996

• The Sample Populations the survey was distributed to four different populations in

hopes that interesting intergroup differences might be revealed • The survey asked for age, sex, and last movies seen in a movie theater

All six tasks in one small database

以電影迷 (Moviegoers) 資料庫為例


∞

∞∞

∞

1

11

The layout of the moviegoers database


姓名性別年紀來源地點電影名稱

Amy 女 27 Oberlin Independence day

Andrew 男 25 Oberlin 12 monkeys

Andy 男 34 Oberlin The birdcage

Anne 女 30 Oberlin Trainspotting

Ansje 女 25 Oberlin I shot andy wrrhol

Beth 女 30 Oberlin Chain reaction

Bob 男 51 Pinewoods Schindler’s list

Brian 男 23 Oberlin Super cop

Candy 女 29 Oberlin Eddie

Cara 女 25 Oberlin Phenomenon

Cathy 女 39 124Mt.Aubum The birdcage

Charles 男 25 Oberlin Kingpin

Curt 男 30 MRJ T2 judgment day

David 男 40 MRJ Independence day

Erica 女 23 124 Mt.Aubum trainspotting

Moviegoer Survey (The first few rows are shown)


What can data mining do? (1/3) 電影迷分類 (Moviegoer Classification)• 根據年齡、來源以及看的電影來區分性別• 根據性別、年齡以及看的電影來區分來源• 根據以往看過的電影、年齡、性別和來源去區分會看什麼電影 (most recent movie) 技術 : 決策樹

電影迷推估 (Estimation)• 年齡為連續性變數，因此可以作為推估作業的目標變數。• 年齡 = f( 來源地點，性別，看過的電影 )


What can data mining do? (2/3) 電影迷預測 (Prediction)− 預測一部新片上映時，誰會是它的觀眾 ?

將影迷與電影進行群集分析針對每一群影迷，挖掘規則來解釋這群人的電影品味針對每一群電影，挖掘規則描述其最佳目標觀眾新電影上映時，由新電影所屬群集就可以找出目標觀眾

電影迷關聯分組 (Affinity grouping)

− 哪些電影總是被同類的人觀賞 (which movies go together?)

− 經由產生的關聯法則來分析性別的分類 (Virtual items)


What can data mining do? (3/3)

電影迷群集化− to find groups of movies that go together because they

are seen by the same people− to find groups of people that go together because they

see the same movies people with young children form a clearly recognizable cluster

in the moviegoers database

電影迷描述− 基本統計量 : 平均年齡、女性人口百分比。− 關聯規則 : 看過 X電影的人也會看 Y電影− 規則也可視為一種描述 :12~17歲的男性喜歡看 X電影


Evaluation and Interpretation Model validation

– after building a model, you must evaluate its results and interpret their significance

– accuracy by itself is not necessarily the right metric for selecting the best model. You need to know more about the type of errors and the costs associated with them

Confusion matrices– for classification problem, a confusion matrix is a very

useful tool for understanding results

– it shows not only how well the model predicts, but also presents the details needed to see exactly where things may have gone wrong


Confusion matrix (1/2)Actual

Prediction Class A Class B Class C

Class A 45 2 3

Class B 10 38 2

Class C 4 6 40

– this is much more informative than simply telling us an overall accuracy rate of 82% (123/150)

– If there are different costs associated with different errors, a model with a lower overall accuracy may be preferable to one with higher accuracy but a greater cost to the organization due to the types of errors it makes

Mod

el

X


Confusion matrix (2/2)Actual

Prediction Class A Class B Class C

Class A 40 12 10

Class B 6 38 1

Class C 2 1 40

– The accuracy has dropped to 79% (118/150)

– Suppose each correct answer had a value of $10 and each incorrect answer for class A had a cost of $5, for class B a cost of $10, and for class C a cost of $20

The net value of model X = (123*10)-(5*5)-(12*10)-(10*20) = 885

The net value of model Y = (118*10)-(22*5)-(7*10)-(3*20) = 940

Mod

el

Y


Confusion matrix 的使用 (1/4)

Data mining: 利用 historical data 找出 rare event

高度獲利或嚴重損失，但是針對所有的客戶採取行動，又顯得划不來

使用 confusion matrix 可以獲得三種資訊 : 3R

Response Rate (回應率 ): 在我們預測的名單中找出多少稀有事件 ?

Recall (反查 ): 預測出來的稀有事件佔總體稀有事件多少比例 ?

Range Reduce ( 間距縮減 ): 透過資料採礦模型來找尋稀有事件時，名單縮小了多少 ?



Response Rate (回應率 ): 寧缺勿濫的能力 Response Rate = 6961 / (2497+6961) = 73.6%

總體 Response Rate = (6961 + 2171) / (6855+2171+2497+6961) = 49.4%

回應率提升了 1.49 倍

Actual

Prediction Class 0 Class 1

Class 0 6855 2171

Class 1 2497 6961

0: 不會購買 1:會購買



Recall (反查 ):寧可殺錯一萬，不可誤放一人 Recall = 6961 / (6961+2171) = 76.22%

Range Reduce : 根據模型執行活動時的成本 Range Reduce = (6961 + 2497) / (6855+2171+2497+6961) =

51.2%

Actual

Prediction Class 0 Class 1

Class 0 6855 2171

Class 1 2497 6961

0: 不會購買 1:會購買



Which is the best model depends on the business problem

For a marketing response problem, we want to get as many potential responders as possible and we do not care about false positives

For a medical diagnostic test for cancer, we might use such a model as a initial screen. We care a lot about false negatives – and we want as few as possible


The Lift (Gain) Chart• It shows how responses are changed by applying the

model. This change ratio is called the lift


The ROI (Return on Investment) Chart• A pattern may be interesting, but acting on it may cost

more than the revenue or savings it generate• Here, ROI is defined as ratio of profit to cost


The Profit Chart• Profit = revenue minus cost• The maximum lift was achieved at the 1st decile (10%), the

maximum ROI at the 2nd decile (20%), and the maximum profit at the 3rd and 4th deciles


External ValidationNo matter how good the accuracy of a model is estimated

to be, there is no guarantee that it reflects the real world

– One of the main reasons for this problem is that there are always assumptions implicit in the model

The inflation rate may not have been included as a variable in a model that predicts the propensity of an individual to buy

It is important to test a model in the real world

– do a test mailing to verify the model

– try the model on a small set of applicants before full deployment


Deploy the model and results (1/2)The first way is for an analyst to recommend actions

based on simply viewing the model and its results– The analyst may look at the clusters the model has identified,

the rules that define the model, or the lift and ROI charts that depict the effect of the model

The second way is to apply the model to different data sets– to flag records based on their classification,

– to assign a score such as the probability of an action, or

– can select some records from the database and subject these to further analyses with an OLAP tool, and so on


Deploy the model and results (2/2)The amount of time to process each new transaction, and

the rate at which new transactions arrive, will determine whether a parallelized algorithm is needed– Monitoring credit card transactions or cellular telephone calls

for fraud

When delivering a complex application, data mining is often only a small, albeit critical, part of the final product– In a fraud detection system, known patterns of fraud may be

combined with discovered patterns

You must measure how well your model has worked after you use it (model monitoring)– To be retested, retrained and possibly completely rebuilt


Acting on the Results (1/2) Sometimes, it is valuable to incorporate a bit

of experimental design into the process

– If we are predicting customer response to a product, we might have three different groups

1) A group of customers based on the results of the Data Mining model, who get the marketing message

2) A group of customers chosen at random, who get the marketing message

3) A group of customers chosen at random, who do not get the marketing message


– What we hope is that

the first group will have a high response rate

The second group will have a mediocre response rate

The third will have a negligible response rate

– We can test the strength of the marketing message

The difference in response between the second and third groups

– We can test the strength of the data mining

The difference between the first and second groups

Acting on the Results (2/2)


Measuring the Model’s Effectiveness We need to compare the results to what actually

happened in the real world– Did the predicted behavior actually happen?

Did the prospects accept the offer, did the customers purchase the new product, did they churn?

– The lift charts and confusion matrixes can adapted to compare actual results to predicted results

– The score set is usually more recent than the model set Model performance usually degrades over time The model captures patterns from the past and, over time,

the patterns become less relevant


What Makes Predictive Modeling Successful?

A. Modeling Shelf-Life

B. The whole process of predictive modeling is based on some key assumptions


A. Modeling Shelf-Life

Looking at time frames bring up two critical questions about models and their predictions:

① What is the shelf-life of a model?• The things being modeled change over time

• A model created five years ago, or last year, or last month, may no longer be valid

• You need to train a new model on more recent data

② What is the shelf-life of a prediction?• Predictions are valid during a particular time frame


B. Key Assumption 1 (1/2) The Past Is a Good Predictor of the Future

– How patients reacted to a drug in the past– However, external factors will always have an

influence on the model being built Retail sales decrease during cold weather and blizzards Mortgage lending increases when interest rates go down Seasonal patterns

• The Christmas season and back-to-school season derive many retail sales

The model developed during years of relatively stable financial markets were not applicable in the more volatile markets


B. Key Assumption 1 (2/2)

The Past Is a Good Predictor of the Future

– How do we know when the past is a good predictor of the future ? We can never know for sure

It is critical to

Include domain experts (have insight about important factors) in the modeling process

Include enough of the right data (seasonal factors) to make good decisions


B. Key Assumption 2 The Data is Available

– Data may not be available for any number of different reasons The data may not be collected by the operational systems

The data base is too busy most of the time to prepare extracts

The data is owned by an outside vendor

And so on

– Ensuring that the right data is available is critical to building successful predictive models


B. Key Assumption 3 The Data Contains What We Want to Predict– To apply the lessons of the past to the future, we need

to be comparing apples to apples and oranges to oranges

Often, the business people phrase their needs very ambiguously We are interested in people who do not pay their bills

Sometimes business users have unreasonable expectations from their data When building a response model, it must know who responded to

the campaign and who received the campaign For advertising campaigns, the second group is not known However, we can compare the responders to a random sample of the

general population


Selecting Data Mining Products (1/3) There are three main types of data mining products

1) Tools that are analysis aids for OLAP Help OLAP users identify the most important dimensions and

segments on which they should focus attention Business Objects Business Miner, Cognos Scenario

2) The “pure” data mining products Horizontal tools aimed at data mining analysts concerned with

solving a broad range of problems IBM Intelligent Miner, Oracle Darwin, SAS Enterprise Miner,

SGI MineSet, and SPSS Clementine

3) Analytic applications which implement specific business processes for which data mining is an integral part Customized packages with the data mining imbedded


Selecting Data Mining Products (2/3) Basic capabilities

– Nothing substitutes for actual hands-on experience

– Depending on your particular circumstances – system architecture, staff resources, database size, problem complexity – some data mining products will be better suited than others to meet your needs

– System architecture Work on a stand-alone desktop machine or a client-server architecture

– Data preparation

– Data access No single product can support the large variety of database servers

– Algorithms


Selecting Data Mining Products (3/3)Basic capabilities (continued)

– Interfaces to other products Many tools can help you understand your data before you build your

model, and help you interpret the results of your model These include traditional query and reporting tools, graphics and

visualization tools, and OLAP tools

– Model evaluation and interpretation– Model deployment

When you need to apply the model to new cases as they come, it is usually necessary to incorporate the model into a program using an API or code generated by the data mining tool

– Scalability– User interface

The people who build, deploy, and use the results of the models may be different groups with varying skills


The Virtuous Cycle of DM (1/2) Data mining can be applied to many problems in

many industries– Most common applications are in marketing, specifically

for CRM Applied to prospecting for new customers, retaining existing ones,

and increasing customer value Applied to understanding customer behavior and optimizing

manufacturing processes

Although they may have much in common, every application has its own unique characteristics

– Within a single industry, different companies have different strategic plans and different approaches


The Virtuous Cycle of DM (2/2) The virtuous cycle is a high-level process,

consisting of four major business processes:1. Identifying the business problem

2. Transforming data into actionable results

3. Acting on the results

4. Measuring the results

There are no shortcuts – success in DM requires all four processes

– Expertise grows as organizations focus on the right business problems, learn about data and modeling techniques, and improve Data Mining processes based on the results of previous efforts


Data Description and Data MiningModel Building (1/2)

Data mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions

The first and simplest analytical step in data mining is to describe the data– Summarize its statistical attributes (such as means and standard deviations)– Visually review it using charts and graphics (visualization)– Look for potentially meaningful links among variables (such as values that

often occur together)– clustering collecting, exploring, and selecting the right data are

critically important


Data Description and Data MiningModel Building (2/2)

In general, Data description alone cannot provide an action plan– You must build a predictive model based on patterns

determined from known results (model training), then test that model on results outside the original sample (model testing) The accuracy (or error) rate is a good estimate of how the

model will perform on the future dataset that are similar to the training and test datasets

– finally, you must empirically verify the model• e.g., send a mailing to a portion of the new list and see

what results you get


Predictive Data Mining (1/2) A hierarchy of choices

– Business goal What is the ultimate purpose of mining this data? Retain good customers, identify customers likely to leave, or

predict customer profitability

– Type of Prediction Classification or Regression

– Model type Neural networks or decision trees Your choice of model type will influence what data preparation

you must do and how you go about it

– Algorithm– Product

They generally have different implementations of a particular algorithm even they identify it with the same name


Predictive Data Mining (2/2)

No tool or technique is perfect for all data– Many business goals are best met by building

multiple model types using a variety of algorithms

– You may not be able to determine which model type is best until you’ve tried several approaches


Summary (1/2)Data mining offers great promise in helping

organizations uncover patterns hidden in their data that can be used to predict the behavior of customers, products and processes

However, data mining tools need to be guided by users who understand the business, the data, and the general nature of the analytical methods involved


Summary (2/2)Building models is only one step in knowledge discovery

– It is vital to properly collect and prepare the data, and to check the models against the real world

– The “best” model is often found after building models of several different types, or by trying different technologies or algorithms

Choosing the right data mining products means finding a tool with good basic capabilities– an interface that matches the skill level of the people who’ll be

using it, and features relevant to your specific business problems

– After you’ve narrowed down the list of potential solutions, get a hands-on trial of the likeliest ones


Data Mining: Classification Schemes

• General functionality

– Descriptive data mining

– Predictive data mining

• Different views, different classifications

– Kinds of databases to be mined

– Kinds of knowledge to be discovered

– Kinds of techniques utilized

– Kinds of applications adapted


A Multi-Dimensional View of Data Mining Classification

• Databases to be mined

– Relational, transactional, object-oriented, object-relational, active, spatial, time-series, text, multi-media, heterogeneous, WWW, etc.

• Knowledge to be mined

– Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc.

– Multiple/integrated functions and mining at multiple levels

• Techniques utilized

– Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural network, etc.

• Applications adapted

– Retail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc.


Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP


資料庫之知識發掘的相關技術


Architecture of a Typical Data Mining System

Data Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base


資料探勘的基本元件與概念性架構


資料探勘在顧客關係管理之應用 • 零售業者而言

– 瞭解顧客消費特性，發掘顧客採購模式，強化客戶關係，達到留住顧客目的

• 銀行業者而言– 瞭解信用卡發放可能產生之弊端，找出最有利潤、忠

誠度佳的顧客• 保險業者而言

– 分析保戶要求理賠之模式，並可加強稽核，以防止詐財之發生

• 優點– 有效地在不同層面增加公司收益，達成營運目標


資料探勘在網路行銷之應用• 分析顧客於網站上之行為模式

– 當顧客拜訪網站時，往往提供許多寶貴的資料，如個人資料、點選的網頁內容、在網頁所停留的時間、利用搜尋引擎時所使用的關鍵字、以及顧客到訪網站的時間點等，企業可藉由分析這些資訊來瞭解顧客的行為模式，藉以提高顧客對公司所提供之產品與服務的滿意度。

• 應用範例– 可用以下特性區分訪客的特質

•地理區隔– 包括訪客地址、收入、購買能力

• 人格特質– 訪客之購買特性，是否為衝動性或精打細算型

•訪客使用之資訊設備– 網路頻寬、操作系統、瀏覽器或伺服器


資料探勘在網路入侵行為分析之應用

• 發掘異常網路行為– 傳統分析突發網路狀況，需很長時間– 利用高速運算，分析異常網路行為、動態調整與更新防禦機制

• 應用範例– 協助網管執行進階的網路控管，並動態調整與更新防禦機制，進而遏阻網路入侵攻擊的潛在威脅

– 協助網管建立正常網路行為模型、異常的行為模型


資料探勘在網路學習之應用• 適性化網路學習 (Adaptive E-learning)

– 提供適合學習路徑給不同背景學習者– 建構「學習概念圖 (concept map)」規劃學生學習路徑– 分析成績了解試題關連性，推導對應之概念

• 應用範例– 利用關連法則探勘技術– 分析學習者的學習成績並了解試題間的關連性– 推導出相對應於試題之概念間的關連– 找出可以幫助領域專家建構學習概念圖的法則– 構建適切的課程概念圖。


請不要輕看 Data Mining Data Mining 的熱門應用領域

1. 生物科技產業與 DNA 資料分析2. 金融資料分析3. 零售業資料分析4. 電信產業

Data Stream mining Privacy-Preserving mining Distributed data mining Mining of sequence data, multimedia, Web data Biological and biomedical data analysis


請不要高估 Data Mining Data Mining 並不是萬靈丹 Data Mining 的成功需要領域知識與經驗 Data Mining 的應用需要各類專家討論題

– 想想看 : 一個銀行的 Data Mining 案子– 想要 Mining 出那種人可能信用不好– 請問 : 可能需要那幾種專家 ?


Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology

Statistics

OtherDisciplines

InformationScience

MachineLearning Visualization


如何成為 Data Mining 專家

Data Mining 之觀念與技術

不斷運用之經驗Domain Knowledge( 領域相關知識 )

2015/12/13data mining1 what is cluster analysis? (1/4) cluster: a collection of data objects (...

Documents