a statistical recipe for data mining 資料探勘中不可不知的統計工具 yuh-jye lee 李育...

A Statistical Recipe for Data Mining資料探勘中不可不知的統計工具

Yuh-Jye Lee

李育杰Dept. of Computer Science & Information EngineeringNational Taiwan University of Science and Technology

Institute of Statistical Science Academia Sinica

June 28, 2005

統計科學營

We Are Overwhelmed with Data

Data collecting and storing are no longer expensive and difficult tasks.

Earth-orbiting satellite transmits terabytes of data every day

The gap between the generation of data and our understanding of it is growing explosively

New York Stock Exchanges processes, on average, 333 million transactions per day

What is Data Mining?

The process of extracting hidden and usefulinformation or discovering knowledge from

massive datasets. The process has to be under acceptable

computational efficiency limitations.

(Get the Patterns that You Can’t See)

Computer makes it possible.

One of the 10 emerging technologies that will

change the world (MIT Technology Review).Feb. 08, 2001

Data Mining Applications

Consumer relation management (CRM)

Consumer behavior analysis Market basket analysis

Consumer’s loyalty Disease diagnosis and prognosis Drug discovery Bioinformatics Microarray gene expression data analysis

Sport : NBA coaches’ latest weapon Toronto Raptors

(Why Amazon Know What You Need?)

Fundamental Problems in Data Mining

Classification problems (Supervised learning) Test classifier on fresh data to evaluate success

Decision trees, neural networks, support vector machines, k-nearest neighbor, Naive Bayes, etc.

Classification results are objective

Linear & nonlinear regression

Decision tree regression, model tree regression"-insensitive regression

Fundamental Problems in Data Mining

Clustering (Unsupervised learning)

Association rules Minimum support and confidence are required

Clustering results are subjective k-means algorithm and k-medians algorithm

Feature selection & Dimension Reduction

Too many features could degrade generalization performance, curse of dimensionality

Occam’s razor: the simplest is the best

Binary Classification ProblemLearn a Classifier from the Training Set

Given a training dataset

Main goal: Predict the unseen class label for new data

xi 2 A+ , yi = 1 & xi 2 Aà , yi = à 1

S = f (xi;yi)ììxi 2 Rn;yi 2 f à 1;1g; i = 1; . . .;mg

Find a function by learning from data

f : Rn ! R

f (x) > 0) x 2 A+ and f (x) < 0) x 2 Aà

(I)

(II)Estimate the posteriori probability of label

Pr(y = 1jx) > Pr(y = à 1jx) ) x 2 A+

Binary Classification Problem

Linearly Separable Case

A-

A+

x0w+ b= à 1

wx0w+ b= + 1x0w+ b= 0

Malignant

Begin

Support Vector Machines for Classification

Maximizing the Margin between Bounding Planes

A+

A-

x0w+ b= 1

x0w+ b= à 1

Summary the Notations

Let S = f (x1;y1); (x2;y2); . . .(xm;ym)gbe a training dataset and represented by matrices

A =

(x1)0

(x2)0...

(xm)0

2

64

3

75 2 Rmâ n;D =

y1 ááá 0......

...0 ááá ym

" #

2 Rmâ m

e= [1;1; . . .;1]02 Rm:D(Aw+ eb)>e, where

A iw+ b > + 1; for D ii = + 1;A iw+ b 6 à 1; for D ii = à 1

equivalent to

D(Aw+ eb) >e+ øø>0

whereø: nonnegative slack (error) vector

The term e0ø, 1-norm measure of error vector, is

called the training error.

minw;b;ø

e0ø

s.t. (LP)

Robust Linear ProgrammingPreliminary Approach to SVM

For the linearly separable case, at solution of (LP):

ø= 0

xj

x

x

x

x

x

x

x

x

o

o

o

o

o

o

o

oi

í

í

øj

øi

(Two Different Measures of Training Error)

min(w;b;ø)2Rn+1+m

21jjwjj22+ 2

Cjjøjj22

D(Aw+ eb) + ø>e

2-Norm Soft Margin:

1-Norm Soft Margin:min

(w;b;ø)2Rn+1+m21jjwjj22+ Ce0ø

D(Aw+ eb) + ø>e; ø> 0

Margin is maximized by minimizing reciprocal of

margin.

Support Vector Machine Formulations

Tuning ProcedureHow to determine C?

overfitting

The final value of parameter is one with the maximum testing set correctness !

C

Two-spiral Dataset(94 White Dots & 94 Red Dots)

Noise : mean=0 ,

û = 0:4

Estimated Function

Original Function

Support Vector Regression:

K (A;A0) 2 R28900â 300

Parameter :

Training time : 22.58 sec.MAE of 49x49 mesh points : 0.0513

C = 10000; ö = 1; ï = 0:2

Naïve Bayes for Classification ProblemGood for Binary as well as Multi-category Let each attribute be a random variable.

What is the probability of the class given an instance?

Pr(Y = yjX 1 = x1;X 2 = x2; . . .X n = xn) =?

Naïve Bayes assumptions: The importance of each attribute is equal All attributes are independent !

Pr(Y = yjX 1 = x1;X 2 = x2; . . .X n = xn)

= Pr(X =x)1 Q

j=1

n

Pr(X j = xj jY = y)

The Weather Data ExampleIan H. Witten & Eibe Frank, Data Mining

Outlook Temperature

Humidity Windy Play (Label)

SunnySunny

Overcast

RainyRainyRainy

Overcast

SunnySunnyRainySunny

Overcast

Overcast

Rainy

HotHotHotMildCoolCoolCoolMildCoolMildMildMildHotMild

HighHighHighHigh

NormalNormalNormal

HighNormalNormalNormal

HighNormal

High

FalseTrueFalseFalseFalseTrueTrueFalseFalseFalseTrueTrueFalseTrue

-1-1+1+1+1-1+1-1+1+1+1+1+1-1

Probabilities for the Weather Data Using Frequencies to Approximate Probabilities

Outlook Temp. Humidity Windy Play

Play Yes No

Yes No Yes No Yes No Yes No

SunnyOvercastRainy

2/94/93/9

3/50/52/5

HotMildCool

2/94/93/9

2/52/51/5

HighNormal

3/96/9

4/51/5

TF

3/96/9

3/52/5 9/1

45/14

Pr(X 1 =0rainy0jY = 1) Pr(Y = 1)

Pr(Y = 1jsunny;cool;high;T) / 92 á9

3 á93 á9

3 á149

Pr(Y = à 1jsunny;cool;high;T) / 53 á5

1 á54 á5

3 á145

Likelihood of the two classes:

0/5

???

The Zero-frequency Problem 沒觀察到，不表示永不出現！

Q: 擲骰子 8 次，分別出現： 2, 5, 6, 2, 1, 5, 3, 6 。請問出現點數為 4 的機率為何？

What if an attribute value does NOT occur with a class value?

The posterior probability will all be zero! No matter how likely the other attribute values are!

P(X = 4) = 8+6õ0+õ ; P(X = 5) = 8+6õ

2+õ

Laplace estimator will fix “zero-frequency”n+aõk+õ

How to Evaluate What’s Been Learned?Cost is NOT sensitive

Measure the performance of a classifier in term of error rate or accuracy

Main goal: Predict the unseen class label for new data We have to asses a classifier’s error rate on a set

that play no role in the learning process Split the data instances in hand into two parts:

Training set: for learning the classifier Testing set: for evaluating the classifier

Error rate=Number of misclassified point Total number of data point

k-fold (Stratified) Cross Validation Maximize the usage of the data in hands

Split the data into k approximately equal partitions Each in turn is used for testing while the remainder is used for training The labels (+/-) in the training and testing sets should be in about right proportion

Doing the random splitting in positive class and negative class respectively will guarantee it This procedure is called stratification

Leave-one-out cross-validation if k=# of data point No random sampling is involved but nonstratified

How to Compare Two Classifiers?Testing Hypothesis: Paired t-test

We compare two learning algorithms by comparing the average error rate over several cross-validations Assume the same cross-validation split can be used for both methods

t = d

û2d=k

q The t-statistic:

H0 : d= 0 vs: H1 : d6=0

di = xi à yiandd= k1P

i=1

kdiwhere

How to Evaluate What’s Been Learned?When cost is sensitive

Two types error will occur: False Positives (FP) & False Negatives (FN)

For binary classification problem, the results can be summarized in a confusion matrix2â 2

Predicted Class

ActualClass

True Pos. (TP)

False Neg. (FN)

False Pos. (FP)

True Neg. (TN)

ROC CurveReceiver Operating Characteristic

Curve

An evaluation method for learning models.

What it concerns about is the Ranking of instances made by the learning model.

A Ranking means that we sort the instances w.r.t. the probability of being a positive instance from high to low.

ROC curve plots the true positive rate (TPr) as a function of the false positive rate (FPr).

An Example of ROC Curve

Inst ID Class Score

1234567891011121314151617181920

PPPPPPPPPPNNNNNNNNNN

0.510.80.30.550.40.340.90.540.380.60.350.520.360.370.70.10.390.530.50.33

7215104818121195179141311620316

PPNPPPNNPNPNPNNNPNPN

0.90.80.70.60.550.540.530.520.510.50.40.390.380.370.360.350.340.330.30.1

Inst ID Class Score

Sort

Using ROC to Compare Two Methods

Under the same FP rate, method A is better than method B

A

B

1

10

What if There is a Tie?

A

B

1

10

Which one is better?

Area under the Curve (AUC)

An index of ROC curve with range from 0 to 1.

An AUC value of 1 corresponds to a perfect Ranking (all positive instance are ranked high than all negative instance).

A simple formula for calculating AUC ：

AUC = mn

Pi=1m P

j=1n I f f (xi)>f (xj)g

where m : number of positive instancesn : number of negative instances

Performance Measures in Information Retrieval (IR)

An IR system, such as Google, for given a query (key words search) will try to retrieve all relevant documents in a corpus Documents returned that are NOT relevant: FP The relevant documents that are NOT return: FN

TPPrecision = TP+FP

Recall = TP+FN

TP and

Performance measures in IR, Recall & Precision:

Balance the Tradeoff between Recall and Precision: F-measure

Two extreme cases:

Return only one document with 100% confidence then precision=1 but recall will be very small

Return all documents in the corpus then recall=1 but precision will be very small

F-measure balances this tradeoff:

F-measure = 21 1

Recall Precision+

Curse of Dimensionality Deal with High Dimensional

Datasets

Learning in very high dimensions with very few samples. For example, microarray dataset:

Acute leukemia dataset: 7129 # of gene vs. 72 samples

Colon cancer dataset: 2000 # of gene vs. 62 samples

Feature selection will be needed

In text mining, there are many useless words which were called stopwords, such as: is, I, and …

Feature Selection –Filter Model Using Fisher-like Score Approach

Feature 1 Feature 2 Feature 3

(ö+1 à öà

1 ) (ö+2 à öà

2 ) (ö+3 à öà

3 )

Weight Score Approach

Weight score:

wj û+

j+û

à

j

ö+

jà ö

à

j=

where andöj ûj are the mean and standard deviation of j th

feature for training examples of positive or negative class.

The testÿ2

Test the class label and a single attribute if they are “significantly” correlated with each other

Two-class classification vs. Binary Attribute

ÿ2 = (k11+k10)(k00+k01)(k11+k01)(k00+k10)n(k11k00à k10k01)2

Use a contingency matrix to summarize the data

Class Values: 0 & 1

Attribute Values: 0 & 1

k00 k01

k10 k11

measure aggregates the deviation of observed values from expected values (independence hypoth.)

ÿ2

Mutual Information (MI)

Let and be discrete random variables. The mutual information between them is defined as

X Y

MI(X ;Y) =P

x

P

yPr(x;y) logPr(x)Pr(y)

Pr(x;y)

(X ;Y) = 0 , and are independentX Y MI

The more positive MI is the more correlated and are

XY

Ranking Features for ClassificationFilter Model

The perfect feature selection should consider all possible subsets of features

For each subset train and test a classifier Retain the subset that resulted in the highest accuracy (computational infeasible)

Measure the “discrimination ability” of each feature For example, weight score, MI and measureÿ2

Highly linear correlated features might be selected

Rank the features w.r.t. this measure and select top p features

“Publication has been extended far beyond our present ability to make real use of the record” V. Bush, As we may think, Atlantic Monthly, 176(1945), p.101-108

Can Computer Read? Text Classification & Web

Mining

Google 介紹

Google 一詞是數學名詞「 Googol 」的諧音 Googol: 很大的正數 ,1 之後跟隨 100

個零的數 ( 此數超過宇宙中原子的數目 ,後者僅為 1085 的數量級 ) ( 註 1)

代表是 10 的 100 次方 (10100)

( 註 1: 數學大辭典 , 1999 年 , 貓頭鷹出版 , p.311)

Google 的歷史 Google 的前身 BackRub 公司創立於 1998 年 9 月創辦人 : 布林 (Sergey Brin) 與佩吉

(Larry Page) 美國史丹佛大學博士班學生

檢索架構

( 資料來源 : http://www-db.stanford.edu/~backrub/google.html)

詞語典

分類

URL 分析

網頁排序

PageRank 說明 PageRank 如同個別網頁價值的指示器，

透過龐大的連結架構來信賴網站獨特地民主性質。

簡單來說， Google 說明網頁 A 連結至網頁 B 時，則視為網頁 A 投給網頁 B 一票。當然， Google 會查看票數來源，或是連結網頁接收的票數；同時它也會分析參予投票的網頁。透過「重要的」網頁來參予投票，並且幫助其它的網頁也成為「重要的」網頁資料。

重要、優質的網站會得到較高的 PageRank ，同時 Google 會記住每次所處理的查詢情況。當然，如果查詢出來的網頁結果並不符合您的需求，重要的網頁對您也不具任何意義。

因此， Google 將 PageRank 和精密的內文比對技術結合，來找出重要並且與您的查詢相關的網頁。 Google 會將出現於網頁上的字詞顯示出來，並且檢查所有的網頁內容﹝及連結到此網頁的其他網頁內容﹞以決定這樣的查詢結果是否最符合您的需求。

公正性 Google 這種複雜、自動的方法，使有心篡改

搜尋結果的人很難去篡改。然而儘管我們將相關的廣告放置在搜尋結果的附近， Google 是不會介入販賣廣告內的任何商品﹝換言之，沒有人可以買到較高的 PageRank 結果﹞。Google 搜尋是以簡單、誠實且客觀的方法，來找出與您的搜尋相關的優質網站。

(Summarized from google’s own web page:http://www.google.com.tw/intl/zh-TW/why_use.html)

Preprocessing of Text ClassificationConvert documents into data input

Stopwords such as: a, an, the, be, on … Eliminating stopwords will reduce space and improving performance Polysemy ( 同字異義 ): “can” :verb & noun

“to be or not to be” Stemming or Conflation using Porter’s algorithm

“university” and “universal” “univers” Stemming increases the number of documents in the response but also irrelevant documents

Reuters-2157821578 docs – 27000 terms, and 135

classes

21578 documents 1-14818 belong to training set 14819-21578 belong to testing set

Reuters-21578 includes 135 categories by using ApteMod version of the TOPICS set

Result in 90 categories with 7,770 training documents and 3,019 testing documents

Preprocessing Procedures (cont.)

After Stopwords Elimination

After Porter Algorithm

Binary Text Classificationearn(+) vs. acq(-)

Select top 500 terms using mutual information

Evaluate each classifier using F-measure

Compare two classifiers using 10-fold paired-t test

Fold 1 2 3 4 5 6 7 8 9 10RSVM 0.965 0.975 0.99 0.98

40.974

0.984

0.936 0.98 0.974

0.974

NB 0.969 0.984 0.969

0.974

0.941

0.964

0.974 0.974

0.953

0.958

-0.004

-0.009

0.021

0.01 0.033

0.02 -0.038

0.006

0.021

0.016

di

H 0 :There is no difference between RSVM and NB

10-fold Testing ResultsRSVM vs. Naïve Bayes

t = d

û2d=k

q = 0:00480:0134 = 2:7917> t(0:025;9) = 2:26216

Reject with 95% confidence levelH 0

Conclusion

Computer Science is in need of Statistics

Thank you!

a statistical recipe for data mining 資料探勘中不可不知的統計工具 yuh-jye lee 李 育...

Documents

a statistical recipe for data mining 資料探勘中不可不知的統計工具 yuh-jye lee 李育...