chapter 6 classification and prediction dr. bernard chen ph.d. university of central arkansas
TRANSCRIPT
![Page 1: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/1.jpg)
Chapter 6 Classification and Prediction
Dr. Bernard Chen Ph.D.University of Central Arkansas
![Page 2: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/2.jpg)
Outline
Classification Introduction Decision Tree Classifier Accuracy Measures
![Page 3: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/3.jpg)
Classification and Prediction Classification and Prediction are two
forms of data analysis that can be used to extract models describing important data classes or to predict future data trends
For example: Bank loan applicants are “safe” or “risky” Guess a customer will buy a new computer? Analysis cancer data to predict which one of
three specific treatments should apply
![Page 4: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/4.jpg)
Classification Classification is a Two-Step Process
Learning step:classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Prediction step:predicts categorical class labels (discrete or nominal)
![Page 5: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/5.jpg)
Learning step: Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
![Page 6: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/6.jpg)
Learning step Model construction: describing a set of
predetermined classes Each tuple/sample is assumed to belong to
a predefined class, as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or mathematical formulae
![Page 7: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/7.jpg)
Prediction step:Using the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
![Page 8: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/8.jpg)
Prediction step Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the model
Test set is independent of training set, otherwise over-fitting will occur
![Page 9: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/9.jpg)
N-fold Cross-validation In order to solve over-fitting problem, n-fold
cross-validation is usually used
For example, 7 fold cross validation: Divide the whole training dataset into 7 parts equally Take the first part away, train the model on the rest
6 portions After the model is trained, feed the first part as
testing dataset, obtain the accuracy Repeat step two and three, but take the second part
away and so on…
![Page 10: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/10.jpg)
Supervised learning VS Unsupervised learning Because the class label of each
training tuple is provided, this step is also known as supervised learning
It contrasts with unsupervised learning (or clustering), in which the class label of each training tuple is unknown
![Page 11: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/11.jpg)
Issues: Data Preparation
Data cleaning Preprocess data in order to reduce noise and
handle missing values
Relevance analysis (feature selection) Remove the irrelevant or redundant attributes
Data transformation Generalize and/or normalize data
![Page 12: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/12.jpg)
Issues: Evaluating Classification Methods Accuracy Speed
time to construct the model (training time) time to use the model
(classification/prediction time) Robustness: handling noise and missing
values Scalability: efficiency in disk-resident
databases Interpretability
![Page 13: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/13.jpg)
Outline
Classification Introduction Decision Tree Classifier Accuracy Measures
![Page 14: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/14.jpg)
Decision Tree Decision Tree induction is the learning
of decision trees from class-labeled training tuples
A decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute
Each Branch represents an outcome of the test
Each Leaf node holds a class label
![Page 15: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/15.jpg)
Decision Tree Example
![Page 16: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/16.jpg)
Decision Tree Algorithm
Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive
divide-and-conquer manner At start, all the training examples are at the
root Attributes are categorical (if continuous-
valued, they are discretized in advance) Test attributes are selected on the basis of a
heuristic or statistical measure (e.g., information gain)
![Page 17: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/17.jpg)
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
![Page 18: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/18.jpg)
Attribute Selection Measure: Information Gain (ID3/C4.5) Select the attribute with the highest
information gain Let pi be the probability that an
arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to
classify a tuple in D:)(log)( 2
1i
m
ii ppDInfo
![Page 19: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/19.jpg)
Attribute Selection Measure: Information Gain (ID3/C4.5) Information needed (after using A
to split D into v partitions) to classify D:
Information gained by branching on attribute A
)(||
||)(
1j
v
j
jA DI
D
DDInfo
(D)InfoInfo(D)Gain(A) A
![Page 20: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/20.jpg)
Decision Tree
940.0)14
5(log
14
5)
14
9(log
14
9)5,9()( 22 IDInfo
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
![Page 21: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/21.jpg)
Decision Tree
694.0)2,3(14
5
)0,4(14
4)3,2(
14
5)(
I
IIDInfoage
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
means “age <=30” has 5 out of 14 samples, with 2 yes’s and 3 no’s.
I(2,3) = -2/5 * log(2/5) – 3/5 * log(3/5)
)3,2(14
5I
![Page 22: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/22.jpg)
Tools http://www.ehow.com/how_5144933_calcula
te-log.html
http://web2.0calc.com/
![Page 23: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/23.jpg)
Decision Tree Infoage (D)= 5/14 I(2,3)+4/14 I(4,0)+ 5/14 I(3,2)
=5/14 * ( )+4/14 * (0)+5/14 * ( )=0.694
For type in http://web2.0calc.com/:-2/5*log2(2/5)-3/5*log2(3/5)
![Page 24: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/24.jpg)
Decision Tree
(0.940 - 0.694)
Similarily, we can compute Gain(income)=0.029 Gain(student)=0.151 Gain(credit_rating)=0.048
Since “student” obtains highest information gain, we can partition the tree into:
246.0)()()( DInfoDInfoageGain age
![Page 25: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/25.jpg)
Decision Tree Infoincome (D)= 4/14 I(3,1)+6/14 I(4,2)+ 4/14
I(2,2) =4/14 * ( )+6/14 * ( )+4/14 * (1)=
0.232+0.393+0.285=0.911
Gain(income)=0.940-0.911=0.029
![Page 26: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/26.jpg)
Decision Tree
![Page 27: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/27.jpg)
Decision Tree
![Page 28: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/28.jpg)
Another Decision Tree Example
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
![Page 29: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/29.jpg)
Decision Tree Example Info(Tenured)=I(3,3)=
log2(12)=log12/log2=1.07918/0.30103=3.584958.
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
![Page 30: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/30.jpg)
Decision Tree Example InfoRANK (Tenured)=
3/6 I(1,2) + 2/6 I(1,1) + 1/6 I(1,0)=3/6 * ( ) + 2/6 (1) + 1/6 (0)= 0.79
3/6 I(1,2) means “Assistant Prof” has 3 out of 6 samples, with 1 yes’s and 2 no’s.
2/6 I(1,1) means “Associate Prof” has 2 out of 6 samples, with 1 yes’s and 1 no’s.
1/6 I(1,0) means “Professor” has 1 out of 6 samples, with 1 yes’s and 0 no’s.
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
![Page 31: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/31.jpg)
Decision Tree Example InfoYEARS (Tenured)=
1/6 I(1,0) + 2/6 I(0,2) + 1/6 I(0,1) + 2/6 I (2,0)= 0
1/6 I(1,0) means “years=2” has 1 out of 6 samples, with 1 yes’s and 0 no’s.
2/6 I(0,2) means “years=3” has 2 out of 6 samples, with 0 yes’s and 2 no’s.
1/6 I(0,1) means “years=6” has 1 out of 6 samples, with 0 yes’s and 1 no’s.
2/6 I(2,0) means “years=7” has 2 out of 6 samples, with 2 yes’s and 0 no’s.
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
![Page 32: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/32.jpg)
Group Practice Example
SEX RANK YEARS TENUREDM Assistant Prof 3~5 noF Assistant Prof >=6 yesM Professor <3 yesM Associate Prof >=6 yesM Assistant Prof >=6 noF Associate Prof 3~5 no
![Page 33: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/33.jpg)
Outline
Classification Introduction Decision Tree Classifier Accuracy Measures
![Page 34: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/34.jpg)
Classifier Accuracy Measures
classes (Real) buy computer =
yes
(Real) buy computer =
no
total
(Predict) buy computer = yes
6954 412 7366
(Predict) buy computer = no
46 2588 2634
total 7000 (Buy
Computer)
3000 (Does not buy
Computer)
10000
![Page 35: Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas](https://reader036.vdocuments.site/reader036/viewer/2022081513/56649f575503460f94c7b7f7/html5/thumbnails/35.jpg)
Classifier Accuracy Measures
Alternative accuracy measures (e.g., for cancer diagnosis)sensitivity = t-pos/pos = 6954/7000 specificity = t-neg/neg = 2588/3000 precision = t-pos/(t-pos + f-pos) = 6954/7366
accuracy =