a statistical recipe for data mining 資料探勘中不可不知的統計工具 yuh-jye lee 李 育...
TRANSCRIPT
A Statistical Recipe for Data Mining資料探勘中不可不知的統計工具
Yuh-Jye Lee
李 育 杰Dept. of Computer Science & Information EngineeringNational Taiwan University of Science and Technology
Institute of Statistical Science Academia Sinica
June 28, 2005
統 計 科 學 營
We Are Overwhelmed with Data
Data collecting and storing are no longer expensive and difficult tasks.
Earth-orbiting satellite transmits terabytes of data every day
The gap between the generation of data and our understanding of it is growing explosively
New York Stock Exchanges processes, on average, 333 million transactions per day
What is Data Mining?
The process of extracting hidden and usefulinformation or discovering knowledge from
massive datasets. The process has to be under acceptable
computational efficiency limitations.
(Get the Patterns that You Can’t See)
Computer makes it possible.
One of the 10 emerging technologies that will
change the world (MIT Technology Review).Feb. 08, 2001
Data Mining Applications
Consumer relation management (CRM)
Consumer behavior analysis Market basket analysis
Consumer’s loyalty Disease diagnosis and prognosis Drug discovery Bioinformatics Microarray gene expression data analysis
Sport : NBA coaches’ latest weapon Toronto Raptors
(Why Amazon Know What You Need?)
Fundamental Problems in Data Mining
Classification problems (Supervised learning) Test classifier on fresh data to evaluate success
Decision trees, neural networks, support vector machines, k-nearest neighbor, Naive Bayes, etc.
Classification results are objective
Linear & nonlinear regression
Decision tree regression, model tree regression"-insensitive regression
Fundamental Problems in Data Mining
Clustering (Unsupervised learning)
Association rules Minimum support and confidence are required
Clustering results are subjective k-means algorithm and k-medians algorithm
Feature selection & Dimension Reduction
Too many features could degrade generalization performance, curse of dimensionality
Occam’s razor: the simplest is the best
Binary Classification ProblemLearn a Classifier from the Training Set
Given a training dataset
Main goal: Predict the unseen class label for new data
xi 2 A+ , yi = 1 & xi 2 Aà , yi = à 1
S = f (xi;yi)ììxi 2 Rn;yi 2 f à 1;1g; i = 1; . . .;mg
Find a function by learning from data
f : Rn ! R
f (x) > 0) x 2 A+ and f (x) < 0) x 2 Aà
(I)
(II)Estimate the posteriori probability of label
Pr(y = 1jx) > Pr(y = à 1jx) ) x 2 A+
Binary Classification Problem
Linearly Separable Case
A-
A+
x0w+ b= à 1
wx0w+ b= + 1x0w+ b= 0
Malignant
Begin
Support Vector Machines for Classification
Maximizing the Margin between Bounding Planes
A+
A-
x0w+ b= 1
x0w+ b= à 1
Summary the Notations
Let S = f (x1;y1); (x2;y2); . . .(xm;ym)gbe a training dataset and represented by matrices
A =
(x1)0
(x2)0...
(xm)0
2
64
3
75 2 Rmâ n;D =
y1 ááá 0......
...0 ááá ym
" #
2 Rmâ m
e= [1;1; . . .;1]02 Rm:D(Aw+ eb)>e, where
A iw+ b > + 1; for D ii = + 1;A iw+ b 6 à 1; for D ii = à 1
equivalent to
D(Aw+ eb) >e+ øø>0
whereø: nonnegative slack (error) vector
The term e0ø, 1-norm measure of error vector, is
called the training error.
minw;b;ø
e0ø
s.t. (LP)
Robust Linear ProgrammingPreliminary Approach to SVM
For the linearly separable case, at solution of (LP):
ø= 0
xj
x
x
x
x
x
x
x
x
o
o
o
o
o
o
o
oi
í
í
øj
øi
(Two Different Measures of Training Error)
min(w;b;ø)2Rn+1+m
21jjwjj22+ 2
Cjjøjj22
D(Aw+ eb) + ø>e
2-Norm Soft Margin:
1-Norm Soft Margin:min
(w;b;ø)2Rn+1+m21jjwjj22+ Ce0ø
D(Aw+ eb) + ø>e; ø> 0
Margin is maximized by minimizing reciprocal of
margin.
Support Vector Machine Formulations
Tuning ProcedureHow to determine C?
overfitting
The final value of parameter is one with the maximum testing set correctness !
C
Two-spiral Dataset(94 White Dots & 94 Red Dots)
Noise : mean=0 ,
û = 0:4
Estimated Function
Original Function
Support Vector Regression:
K (A;A0) 2 R28900â 300
Parameter :
Training time : 22.58 sec.MAE of 49x49 mesh points : 0.0513
C = 10000; ö = 1; ï = 0:2
Naïve Bayes for Classification ProblemGood for Binary as well as Multi-category Let each attribute be a random variable.
What is the probability of the class given an instance?
Pr(Y = yjX 1 = x1;X 2 = x2; . . .X n = xn) =?
Naïve Bayes assumptions: The importance of each attribute is equal All attributes are independent !
Pr(Y = yjX 1 = x1;X 2 = x2; . . .X n = xn)
= Pr(X =x)1 Q
j=1
n
Pr(X j = xj jY = y)
The Weather Data ExampleIan H. Witten & Eibe Frank, Data Mining
Outlook Temperature
Humidity Windy Play (Label)
SunnySunny
Overcast
RainyRainyRainy
Overcast
SunnySunnyRainySunny
Overcast
Overcast
Rainy
HotHotHotMildCoolCoolCoolMildCoolMildMildMildHotMild
HighHighHighHigh
NormalNormalNormal
HighNormalNormalNormal
HighNormal
High
FalseTrueFalseFalseFalseTrueTrueFalseFalseFalseTrueTrueFalseTrue
-1-1+1+1+1-1+1-1+1+1+1+1+1-1
Probabilities for the Weather Data Using Frequencies to Approximate Probabilities
Outlook Temp. Humidity Windy Play
Play Yes No
Yes No Yes No Yes No Yes No
SunnyOvercastRainy
2/94/93/9
3/50/52/5
HotMildCool
2/94/93/9
2/52/51/5
HighNormal
3/96/9
4/51/5
TF
3/96/9
3/52/5 9/1
45/14
Pr(X 1 =0rainy0jY = 1) Pr(Y = 1)
Pr(Y = 1jsunny;cool;high;T) / 92 á9
3 á93 á9
3 á149
Pr(Y = à 1jsunny;cool;high;T) / 53 á5
1 á54 á5
3 á145
Likelihood of the two classes:
0/5
???
The Zero-frequency Problem 沒觀察到,不表示永不出現!
Q: 擲骰子 8 次,分別出現: 2, 5, 6, 2, 1, 5, 3, 6 。請問 出現點數為 4 的機率為何?
What if an attribute value does NOT occur with a class value?
The posterior probability will all be zero! No matter how likely the other attribute values are!
P(X = 4) = 8+6õ0+õ ; P(X = 5) = 8+6õ
2+õ
Laplace estimator will fix “zero-frequency”n+aõk+õ
How to Evaluate What’s Been Learned?Cost is NOT sensitive
Measure the performance of a classifier in term of error rate or accuracy
Main goal: Predict the unseen class label for new data We have to asses a classifier’s error rate on a set
that play no role in the learning process Split the data instances in hand into two parts:
Training set: for learning the classifier Testing set: for evaluating the classifier
Error rate=Number of misclassified point Total number of data point
k-fold (Stratified) Cross Validation Maximize the usage of the data in hands
Split the data into k approximately equal partitions Each in turn is used for testing while the remainder is used for training The labels (+/-) in the training and testing sets should be in about right proportion
Doing the random splitting in positive class and negative class respectively will guarantee it This procedure is called stratification
Leave-one-out cross-validation if k=# of data point No random sampling is involved but nonstratified
How to Compare Two Classifiers?Testing Hypothesis: Paired t-test
We compare two learning algorithms by comparing the average error rate over several cross-validations Assume the same cross-validation split can be used for both methods
t = d
û2d=k
q The t-statistic:
H0 : d= 0 vs: H1 : d6=0
di = xi à yiandd= k1P
i=1
kdiwhere
How to Evaluate What’s Been Learned?When cost is sensitive
Two types error will occur: False Positives (FP) & False Negatives (FN)
For binary classification problem, the results can be summarized in a confusion matrix2â 2
Predicted Class
ActualClass
True Pos. (TP)
False Neg. (FN)
False Pos. (FP)
True Neg. (TN)
ROC CurveReceiver Operating Characteristic
Curve
An evaluation method for learning models.
What it concerns about is the Ranking of instances made by the learning model.
A Ranking means that we sort the instances w.r.t. the probability of being a positive instance from high to low.
ROC curve plots the true positive rate (TPr) as a function of the false positive rate (FPr).
An Example of ROC Curve
Inst ID Class Score
1234567891011121314151617181920
PPPPPPPPPPNNNNNNNNNN
0.510.80.30.550.40.340.90.540.380.60.350.520.360.370.70.10.390.530.50.33
7215104818121195179141311620316
PPNPPPNNPNPNPNNNPNPN
0.90.80.70.60.550.540.530.520.510.50.40.390.380.370.360.350.340.330.30.1
Inst ID Class Score
Sort
Using ROC to Compare Two Methods
Under the same FP rate, method A is better than method B
A
B
1
10
What if There is a Tie?
A
B
1
10
Which one is better?
Area under the Curve (AUC)
An index of ROC curve with range from 0 to 1.
An AUC value of 1 corresponds to a perfect Ranking (all positive instance are ranked high than all negative instance).
A simple formula for calculating AUC :
AUC = mn
Pi=1m P
j=1n I f f (xi)>f (xj)g
where m : number of positive instancesn : number of negative instances
Performance Measures in Information Retrieval (IR)
An IR system, such as Google, for given a query (key words search) will try to retrieve all relevant documents in a corpus Documents returned that are NOT relevant: FP The relevant documents that are NOT return: FN
TPPrecision = TP+FP
Recall = TP+FN
TP and
Performance measures in IR, Recall & Precision:
Balance the Tradeoff between Recall and Precision: F-measure
Two extreme cases:
Return only one document with 100% confidence then precision=1 but recall will be very small
Return all documents in the corpus then recall=1 but precision will be very small
F-measure balances this tradeoff:
F-measure = 21 1
Recall Precision+
Curse of Dimensionality Deal with High Dimensional
Datasets
Learning in very high dimensions with very few samples. For example, microarray dataset:
Acute leukemia dataset: 7129 # of gene vs. 72 samples
Colon cancer dataset: 2000 # of gene vs. 62 samples
Feature selection will be needed
In text mining, there are many useless words which were called stopwords, such as: is, I, and …
Feature Selection –Filter Model Using Fisher-like Score Approach
Feature 1 Feature 2 Feature 3
(ö+1 à öà
1 ) (ö+2 à öà
2 ) (ö+3 à öà
3 )
Weight Score Approach
Weight score:
wj û+
j+û
à
j
ö+
jà ö
à
j=
where andöj ûj are the mean and standard deviation of j th
feature for training examples of positive or negative class.
The testÿ2
Test the class label and a single attribute if they are “significantly” correlated with each other
Two-class classification vs. Binary Attribute
ÿ2 = (k11+k10)(k00+k01)(k11+k01)(k00+k10)n(k11k00à k10k01)2
Use a contingency matrix to summarize the data
Class Values: 0 & 1
Attribute Values: 0 & 1
k00 k01
k10 k11
measure aggregates the deviation of observed values from expected values (independence hypoth.)
ÿ2
Mutual Information (MI)
Let and be discrete random variables. The mutual information between them is defined as
X Y
MI(X ;Y) =P
x
P
yPr(x;y) logPr(x)Pr(y)
Pr(x;y)
(X ;Y) = 0 , and are independentX Y MI
The more positive MI is the more correlated and are
XY
Ranking Features for ClassificationFilter Model
The perfect feature selection should consider all possible subsets of features
For each subset train and test a classifier Retain the subset that resulted in the highest accuracy (computational infeasible)
Measure the “discrimination ability” of each feature For example, weight score, MI and measureÿ2
Highly linear correlated features might be selected
Rank the features w.r.t. this measure and select top p features
“Publication has been extended far beyond our present ability to make real use of the record” V. Bush, As we may think, Atlantic Monthly, 176(1945), p.101-108
Can Computer Read? Text Classification & Web
Mining
Google 介紹
Google 一詞 是數學名詞「 Googol 」的諧音 Googol: 很大的正數 ,1 之後跟隨 100
個零的數 ( 此數超過宇宙中原子的數目 ,後者僅為 1085 的數量級 ) ( 註 1)
代表是 10 的 100 次方 (10100)
( 註 1: 數學大辭典 , 1999 年 , 貓頭鷹出版 , p.311)
Google 的歷史 Google 的前身 BackRub 公司創立於 1998 年 9 月 創辦人 : 布林 (Sergey Brin) 與佩吉
(Larry Page) 美國史丹佛大學博士班學生
檢索架構
( 資料來源 : http://www-db.stanford.edu/~backrub/google.html)
詞語典
分類
URL 分析
網頁排序
PageRank 說明 PageRank 如同個別網頁價值的指示器,
透過龐大的連結架構來信賴網站獨特地民主性質。
簡單來說, Google 說明網頁 A 連結至網頁 B 時,則視為網頁 A 投給網頁 B 一票。當然, Google 會查看票數來源,或是連結網頁接收的票數;同時它也會分析參予投票的網頁。透過「重要的」網頁來參予投票,並且幫助其它的網頁也成為「重要的」網頁資料。
重要、優質的網站會得到較高的 PageRank ,同時 Google 會記住每次所處理的查詢情況。當然,如果查詢出來的網頁結果並不符合您的需求,重要的網頁對您也不具任何意義。
因此, Google 將 PageRank 和精密的內文比對技術結合,來找出重要並且與您的查詢相關的網頁。 Google 會將出現於網頁上的字詞顯示出來,並且檢查所有的網頁內容﹝及連結到此網頁的其他網頁內容﹞以決定這樣的查詢結果是否最符合您的需求。
公正性 Google 這種複雜、自動的方法,使有心篡改
搜尋結果的人很難去篡改。然而儘管我們將相關的廣告放置在搜尋結果的附近, Google 是不會介入販賣廣告內的任何商品﹝換言之,沒有人可以買到較高的 PageRank 結果﹞。Google 搜尋是以簡單、誠實且客觀的方法,來找出與您的搜尋相關的優質網站。
(Summarized from google’s own web page:http://www.google.com.tw/intl/zh-TW/why_use.html)
Preprocessing of Text ClassificationConvert documents into data input
Stopwords such as: a, an, the, be, on … Eliminating stopwords will reduce space and improving performance Polysemy ( 同字異義 ): “can” :verb & noun
“to be or not to be” Stemming or Conflation using Porter’s algorithm
“university” and “universal” “univers” Stemming increases the number of documents in the response but also irrelevant documents
Reuters-2157821578 docs – 27000 terms, and 135
classes
21578 documents 1-14818 belong to training set 14819-21578 belong to testing set
Reuters-21578 includes 135 categories by using ApteMod version of the TOPICS set
Result in 90 categories with 7,770 training documents and 3,019 testing documents
Preprocessing Procedures (cont.)
After Stopwords Elimination
After Porter Algorithm
Binary Text Classificationearn(+) vs. acq(-)
Select top 500 terms using mutual information
Evaluate each classifier using F-measure
Compare two classifiers using 10-fold paired-t test
Fold 1 2 3 4 5 6 7 8 9 10RSVM 0.965 0.975 0.99 0.98
40.974
0.984
0.936 0.98 0.974
0.974
NB 0.969 0.984 0.969
0.974
0.941
0.964
0.974 0.974
0.953
0.958
-0.004
-0.009
0.021
0.01 0.033
0.02 -0.038
0.006
0.021
0.016
di
H 0 :There is no difference between RSVM and NB
10-fold Testing ResultsRSVM vs. Naïve Bayes
t = d
û2d=k
q = 0:00480:0134 = 2:7917> t(0:025;9) = 2:26216
Reject with 95% confidence levelH 0
Conclusion
Computer Science is in need of Statistics
Thank you!