data mining: concepts and techniques unit-iii part-i classification and predictions september 10,...
TRANSCRIPT
![Page 1: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/1.jpg)
DATA MINING: CONCEPTS AND TECHNIQUES
UNIT-III
Part-I Classification and PredictionsApril 21, 2023
DATA MINING CSE@HCST 1
![Page 2: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/2.jpg)
Classification and Prediction
What is classification? What is
prediction?
Issues regarding classification and
prediction
Classification by decision tree
induction
Bayesian classification
*Rule-based classification
Classification by back propagation
Neural Network
*Support Vector Machines (SVM)
*Associative classification
Lazy learners (or learning from
your neighbors)
Other classification methods
*Prediction
*Accuracy and error measures
*Ensemble methods
*Model selection
Summary
April 21, 2023
2
DATA MINING CSE@HCST
![Page 3: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/3.jpg)
Objectives
April 21, 2023DATA MINING CSE@HCST
3
![Page 4: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/4.jpg)
Classification vs. Prediction
April 21, 2023DATA MINING CSE@HCST
4 Classification-
Predicts categorical class labels (discrete or nominal). Classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in classifying new data.
Prediction- Models continuous-valued functions, i.e., predicts unknown or
missing values. Typical applications-
Credit approval. Document categorization. Target marketing. Medical diagnosis. Treatment effectiveness analysis. Fraud detection.
![Page 5: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/5.jpg)
Classification types
April 21, 2023DATA MINING CSE@HCST
5
![Page 6: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/6.jpg)
Classification—A Two-Step Process
April 21, 2023DATA MINING CSE@HCST
6 Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute.
The set of tuples used for model construction is training set. The model is represented as classification rules, decision trees, or
mathematical formulae. Model usage: for classifying future or unknown objects
Estimate accuracy of the model The known label of test sample is compared with the classified result
from the model. Accuracy rate is the percentage of test set samples that are correctly
classified by the model. Test set is independent of training set, otherwise over-fitting will
occur. If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known.
![Page 7: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/7.jpg)
Classification—A Two-Step Process
April 21, 2023DATA MINING CSE@HCST
7
![Page 8: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/8.jpg)
Example-1 : Model Construction
April 21, 2023DATA MINING CSE@HCST
8
![Page 9: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/9.jpg)
9
Example-1: Using the Model in Prediction April 21, 2023DATA MINING CSE@HCST
![Page 10: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/10.jpg)
Example-2 Process (1): Model Construction
April 21, 2023DATA MINING CSE@HCST
10
TrainingData
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
![Page 11: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/11.jpg)
Process (2): Using the Model in Prediction
April 21, 2023DATA MINING CSE@HCST
11
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
![Page 12: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/12.jpg)
How does classification work ?
April 21, 2023DATA MINING CSE@HCST
12
![Page 13: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/13.jpg)
Supervised vs. Unsupervised Learning
April 21, 2023DATA MINING CSE@HCST
13
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations.
New data is classified based on the training set.
Unsupervised learning (clustering)
The class labels of training data is unknown.
Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data.
![Page 14: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/14.jpg)
Issues: Data Preparation
April 21, 2023DATA MINING CSE@HCST
14
Data cleaning Preprocess data in order to reduce noise and handle missing
values Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes Data transformation
Generalize and/or normalize data
![Page 15: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/15.jpg)
Issues: Evaluating Classification Methods
April 21, 2023DATA MINING CSE@HCST
15
Accuracy classifier accuracy: predicting class label. predictor accuracy: guessing value of predicted attributes.
Speed time to construct the model (training time). time to use the model (classification/prediction time).
Robustness: handling noise and missing values. Scalability: efficiency in disk-resident databases. Interpretability
understanding and insight provided by the model. Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules.
![Page 16: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/16.jpg)
Decision Tree Induction: Training Dataset
April 21, 2023DATA MINING CSE@HCST
16
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
This follows an example of Quinlan’s ID3 (Playing Tennis)
![Page 17: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/17.jpg)
Decision Tree
April 21, 2023DATA MINING CSE@HCST
17
![Page 18: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/18.jpg)
Decision Tree
April 21, 2023DATA MINING CSE@HCST
18
![Page 19: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/19.jpg)
Decision Tree
April 21, 2023DATA MINING CSE@HCST
19
![Page 20: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/20.jpg)
Decision Tree
April 21, 2023DATA MINING CSE@HCST
20
![Page 21: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/21.jpg)
Decision Tree
April 21, 2023DATA MINING CSE@HCST
21
![Page 22: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/22.jpg)
Decision Tree
April 21, 2023DATA MINING CSE@HCST
22
![Page 23: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/23.jpg)
Decision Tree
April 21, 2023DATA MINING CSE@HCST
23
![Page 24: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/24.jpg)
Decision Tree
April 21, 2023DATA MINING CSE@HCST
24
![Page 25: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/25.jpg)
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
25
![Page 26: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/26.jpg)
Decision Tree Boundary
April 21, 2023DATA MINING CSE@HCST
26
![Page 27: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/27.jpg)
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
27
![Page 28: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/28.jpg)
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
28
![Page 29: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/29.jpg)
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
29
![Page 30: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/30.jpg)
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
30
![Page 31: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/31.jpg)
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
31
![Page 32: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/32.jpg)
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
32
![Page 33: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/33.jpg)
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
33
![Page 34: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/34.jpg)
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
34
![Page 35: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/35.jpg)
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
35
![Page 36: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/36.jpg)
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
36
![Page 37: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/37.jpg)
37
Attribute Selection Measure: Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
Information needed (after using A to split D into v partitions) to classify D:
Information gained by branching on attribute A
)(log)( 21
i
m
ii ppDInfo
)(||
||)(
1j
v
j
jA DInfo
D
DDInfo
(D)InfoInfo(D)Gain(A) AApril 21, 2023DATA MINING CSE@HCST
![Page 38: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/38.jpg)
April 21, 2023DATA MINING CSE@HCST
38
![Page 39: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/39.jpg)
Gain Ratio for Attribute Selection (C4.5)
April 21, 2023DATA MINING CSE@HCST
39
![Page 40: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/40.jpg)
Gini Index (CART, IBM IntelligentMiner)
April 21, 2023DATA MINING CSE@HCST
40
![Page 41: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/41.jpg)
Comparisons Of Attribute Selection Measures
April 21, 2023DATA MINING CSE@HCST
41
![Page 42: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/42.jpg)
42
Other Attribute Selection Measures
CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence.
C-SEP: performs better than info. gain and gini index in certain cases.
G-statistic: has a close approximation to χ2 distribution.
MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
The best tree as the one that requires the fewest # of bits to both (1) encode
the tree, and (2) encode the exceptions to the tree.
Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than othersApril 21, 2023DATA MINING CSE@HCST
![Page 43: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/43.jpg)
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
43
![Page 44: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/44.jpg)
Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
44
![Page 45: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/45.jpg)
Decision Tree Induction [IMPORTANT]
April 21, 2023DATA MINING CSE@HCST
45
![Page 46: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/46.jpg)
April 21, 2023DATA MINING CSE@HCST
46
![Page 47: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/47.jpg)
EXAMPLE: Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
47
![Page 48: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/48.jpg)
EXAMPLE: Decision Tree Induction
April 21, 2023DATA MINING CSE@HCST
48
![Page 49: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/49.jpg)
April 21, 2023DATA MINING CSE@HCST
49
![Page 50: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/50.jpg)
EXAMPLE: Calculating Gain Ratio
April 21, 2023DATA MINING CSE@HCST
50
![Page 51: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/51.jpg)
Gini Index
April 21, 2023DATA MINING CSE@HCST
51
![Page 52: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/52.jpg)
Calculating Gini Index
April 21, 2023DATA MINING CSE@HCST
52
![Page 53: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/53.jpg)
53
Overfitting and Tree Pruning
Overfitting: An induced tree may overfit the training data- Too many branches, some may reflect anomalies due to noise
or outliers. Poor accuracy for unseen samples.
Two approaches to avoid overfitting- Prepruning: Halt tree construction early ̵ do not split a node
if this would result in the goodness measure falling below a threshold. Difficult to choose an appropriate threshold.
Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees. Use a set of data different from the training data to decide
which is the “best pruned tree”. April 21, 2023DATA MINING CSE@HCST
![Page 54: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/54.jpg)
Overfitting and Tree Pruning
April 21, 2023DATA MINING CSE@HCST
54
![Page 55: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/55.jpg)
55
Enhancements to Basic Decision Tree Induction
Allow for continuous-valued attributes- Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of intervals.
Handle missing attribute values- Assign the most common value of the attribute. Assign probability to each of the possible values.
Attribute construction- Create new attributes based on existing ones that are sparsely
represented. This reduces fragmentation, repetition, and replication.
April 21, 2023DATA MINING CSE@HCST
![Page 56: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/56.jpg)
56
Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Foundation: Based on Bayes’ Theorem. Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and selected neural network classifiers.
Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data.
Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured.April 21, 2023DATA MINING CSE@HCST
![Page 57: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/57.jpg)
57
Bayes’ Theorem: Basics
Total probability Theorem:
Bayes’ Theorem:
Let X be a data sample (“evidence”): class label is unknown. Let H be a hypothesis that X belongs to class C. Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X. P(H) (prior probability): the initial probability-
E.g., X will buy computer, regardless of age, income, …… P(X): probability that sample data is observed. P(X|H) (likelihood): the probability of observing the sample X, given that
the hypothesis holds. E.g., Given that X will buy computer, the prob. that X is 31…40,
medium income.
)()1
|()( iAPM
i iABPBP
)(/)()|()(
)()|()|( XXX
XX PHPHPP
HPHPHP
April 21, 2023DATA MINING CSE@HCST
![Page 58: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/58.jpg)
58
Prediction Based on Bayes’ Theorem
Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem-
Informally, this can be viewed as-
posteriori = [likelihood * prior/evidence]
Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes.
Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost.
)(/)()|()(
)()|()|( XXX
XX PHPHPP
HPHPHP
April 21, 2023DATA MINING CSE@HCST
![Page 59: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/59.jpg)
59
Classification Is to Derive the Maximum Posteriori
Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn).
Suppose there are m classes C1, C2, …, Cm. Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X). This can be derived from Bayes’ theorem-
Since P(X) is constant for all classes, only
needs to be maximized.
)()()|(
)|(X
XX
PiCPiCP
iCP
)()|()|( iCPiCPiCP XX
![Page 60: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/60.jpg)
60
Naïve Bayes Classifier A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
This greatly reduces the computation cost: Only counts the class distribution.
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D).
If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ
and P(xk|Ci) is-
)|(...)|()|(1
)|()|(21
CixPCixPCixPn
kCixPCiP
nk
X
2
2
2
)(
2
1),,(
x
exg
),,()|(ii CCkxgCiP X
![Page 61: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/61.jpg)
61
Naïve Bayes Classifier: Training Dataset
Class:C1:buys_computer = ‘yes’C2:buys_computer = ‘no’
Data to be classified: X = (age <=30, Income = medium,Student = yesCredit_rating = Fair)
age incomestudentcredit_ratingbuys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
April 21, 2023DATA MINING CSE@HCST
![Page 62: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/62.jpg)
62
Naïve Bayes Classifier: An Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357 Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
age income studentcredit_ratingbuys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
April 21, 2023DATA MINING CSE@HCST
![Page 63: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/63.jpg)
63
Avoiding the Zero-Probability Problem
Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero.
Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10).
Use Laplacian correction (or Laplacian estimator) Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003 The “corrected” prob. estimates are close to their
“uncorrected” counterparts.
n
kCixkPCiXP
1)|()|(
![Page 64: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/64.jpg)
64
Naïve Bayes Classifier: Comments
Advantages- Easy to implement. Good results obtained in most of the cases.
Disadvantages- Assumption: class conditional independence, therefore loss of
accuracy. Practically, dependencies exist among variables-
E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc. Dependencies among these cannot be modeled by Naïve Bayes
Classifier. How to deal with these dependencies? Bayesian Belief Networks.
April 21, 2023DATA MINING CSE@HCST
![Page 65: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/65.jpg)
Classification by Backpropagation
April 21, 2023DATA MINING CSE@HCST
65
Backpropagation: A neural network learning algorithm.
Started by psychologists and neurobiologists to develop and
test computational analogues of neurons.
A neural network: A set of connected input/output units
where each connection has a weight associated with it.
During the learning phase, the network learns by adjusting
the weights so as to be able to predict the correct class label of
the input tuples.
Also referred to as connectionist learning due to the
connections between units.
![Page 66: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/66.jpg)
Neural Network as a Classifier
April 21, 2023DATA MINING CSE@HCST
66 Weakness
Long training time. Require a number of parameters typically best determined empirically,
e.g., the network topology or ``structure." Poor interpretability: Difficult to interpret the symbolic meaning behind
the learned weights and of ``hidden units" in the network.
Strength High tolerance to noisy data. Ability to classify untrained patterns. Well-suited for continuous-valued inputs and outputs. Successful on a wide array of real-world data. Algorithms are inherently parallel. Techniques have recently been developed for the extraction of rules from
trained neural networks.
![Page 67: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/67.jpg)
A Neuron (= a perceptron)
The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping
)sign(y
ExampleFor n
0ikii xw
April 21, 2023
DATA MINING CSE@HCST
67
k-
f
weighted sum
Inputvector x
output y
Activationfunction
weightvector w
w0
w1
wn
x0
x1
xn
![Page 68: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/68.jpg)
A Multi-Layer Feed-Forward Neural Network
April 21, 2023DATA MINING CSE@HCST
68
Output layer
Input layer
Hidden layer
Output vector
Input vector: X
wij
i
jiijj OwI
jIje
O
1
1
))(1( jjjjj OTOOErr
jkk
kjjj wErrOOErr )1(
ijijij OErrlww )(jjj Errl)(
![Page 69: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/69.jpg)
How A Multi-Layer Neural Network Works?
April 21, 2023DATA MINING CSE@HCST
69
The inputs to the network correspond to the attributes measured for each
training tuple .
Inputs are fed simultaneously into the units making up the input layer.
They are then weighted and fed simultaneously to a hidden layer.
The number of hidden layers is arbitrary, although usually only one.
The weighted outputs of the last hidden layer are input to units making up
the output layer, which emits the network's prediction.
The network is feed-forward in that none of the weights cycles back to an
input unit or to an output unit of a previous layer.
From a statistical point of view, networks perform nonlinear regression:
Given enough hidden units and enough training samples, they can closely
approximate any function.
![Page 70: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/70.jpg)
Defining a Network Topology
April 21, 2023DATA MINING CSE@HCST
70
First decide the network topology: # of units in the input layer, # of hidden layers (if > 1), # of units in each hidden layer, and # of units in the output layer.
Normalizing the input values for each attribute measured in the training tuples to [0.0—1.0].
One input unit per domain value, each initialized to 0. Output, if for classification and more than two classes, one
output unit per class is used. Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a different network topology or a different set of initial weights.
![Page 71: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/71.jpg)
Backpropagation
April 21, 2023DATA MINING CSE@HCST
71
Iteratively process a set of training tuples & compare the network's prediction
with the actual known target value.
For each training tuple, the weights are modified to minimize the mean
squared error between the network's prediction and the actual target value.
Modifications are made in the “backwards” direction: from the output layer,
through each hidden layer down to the first hidden layer, hence
“backpropagation”.
Steps- Initialize weights (to small random #s) and biases in the network. Propagate the inputs forward (by applying activation function). Backpropagate the error (by updating weights and biases). Terminating condition (when error is very small, etc.).
![Page 72: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/72.jpg)
April 21, 2023DATA MINING CSE@HCST
72
![Page 73: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/73.jpg)
April 21, 2023DATA MINING CSE@HCST
73
![Page 74: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/74.jpg)
Multilayer Neural Network
April 21, 2023DATA MINING CSE@HCST
74
![Page 75: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/75.jpg)
April 21, 2023DATA MINING CSE@HCST
75
![Page 76: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/76.jpg)
April 21, 2023DATA MINING CSE@HCST
76
![Page 77: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/77.jpg)
April 21, 2023DATA MINING CSE@HCST
77
![Page 78: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/78.jpg)
Lazy vs. Eager Learning
April 21, 2023DATA MINING CSE@HCST
78 Lazy vs. eager learning
Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple.
Eager learning (the above discussed methods): Given a set of training set, constructs a classification model before receiving new (e.g., test) data to classify.
Lazy: less time in training but more time in predicting. Accuracy-
Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function.
Eager: must commit to a single hypothesis that covers the entire instance space.
![Page 79: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/79.jpg)
Lazy Learner: Instance-Based Methods
April 21, 2023DATA MINING CSE@HCST
79 Instance-based learning:
Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified.
Typical approaches- k-nearest neighbor approach
Instances represented as points in a Euclidean space. Locally weighted regression
Constructs local approximation. Case-based reasoning
Uses symbolic representations and knowledge-based inference.
![Page 80: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/80.jpg)
The k-Nearest Neighbor Algorithm
April 21, 2023DATA MINING CSE@HCST
80
All instances correspond to points in the n-D space. The nearest neighbor are defined in terms of Euclidean
distance, dist(X1, X2). Target function could be discrete- or real- valued. For discrete-valued, k-NN returns the most common value
among the k training examples nearest to xq.
Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples.
.
_+
_ xq
+
_ _+
_
_
+
.
..
. .
![Page 81: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/81.jpg)
April 21, 2023DATA MINING CSE@HCST81
![Page 82: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/82.jpg)
April 21, 2023DATA MINING CSE@HCST82
![Page 83: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/83.jpg)
April 21, 2023DATA MINING CSE@HCST83
![Page 84: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/84.jpg)
April 21, 2023DATA MINING CSE@HCST84
![Page 85: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/85.jpg)
April 21, 2023DATA MINING CSE@HCST85
![Page 86: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/86.jpg)
April 21, 2023DATA MINING CSE@HCST86
![Page 87: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/87.jpg)
April 21, 2023DATA MINING CSE@HCST87
![Page 88: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/88.jpg)
April 21, 2023DATA MINING CSE@HCST88
![Page 89: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/89.jpg)
April 21, 2023DATA MINING CSE@HCST89
![Page 90: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/90.jpg)
April 21, 2023DATA MINING CSE@HCST90
![Page 91: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/91.jpg)
April 21, 2023DATA MINING CSE@HCST91
![Page 92: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/92.jpg)
April 21, 2023DATA MINING CSE@HCST92
![Page 93: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/93.jpg)
April 21, 2023DATA MINING CSE@HCST93
![Page 94: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/94.jpg)
April 21, 2023DATA MINING CSE@HCST94
![Page 95: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/95.jpg)
April 21, 2023DATA MINING CSE@HCST95
![Page 96: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/96.jpg)
Example : -Nearest Neighbors
K-Nearest Neighbor K-Nearest Neighbor ClassifierClassifier
CustomerCustomer AgAgee
IncomIncomee
No. No. credit credit cardscards
ResponsResponsee
JohnJohn 3535 35K35K 33 NoNo
RachelRachel 2222 50K50K 22 YesYes
HannahHannah 6363 200K200K 11 NoNo
TomTom 5959 170K170K 11 NoNo
NellieNellie 2525 40K40K 44 YesYes
DavidDavid 3737 50K50K 22 ??
April 21, 2023DATA MINING
CSE@HCST96
![Page 97: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/97.jpg)
ExampleExample
K-Nearest Neighbor K-Nearest Neighbor ClassifierClassifier
CustomerCustomer AgAgee
IncomIncomee
(K)(K)
No. No.
cardcardss
JohnJohn 3535 3535 33
RachelRachel 2222 5050 22
HannahHannah 6363 200200 11
TomTom 5959 170170 11
NellieNellie 2525 4040 44
DavidDavid 3737 5050 22
ResponsResponsee
NoNo
YesYes
NoNo
NoNo
YesYes
Distance from Distance from DavidDavid
sqrt [(35-37)sqrt [(35-37)22+(35-+(35-50)50)2 2 +(3-2)+(3-2)22]=]=15.1615.16
sqrt [(22-37)sqrt [(22-37)22+(50-+(50-50)50)2 2 +(2-2)+(2-2)22]=]=1515
sqrt [(63-37)sqrt [(63-37)22+(200-+(200-50)50)2 2 +(1-+(1-2)2)22]=]=152.23152.23
sqrt [(59-37)sqrt [(59-37)22+(170-+(170-50)50)2 2 +(1-2)+(1-2)22]=]=122122
sqrt [(25-37)sqrt [(25-37)22+(40-+(40-50)50)2 2 +(4-2)+(4-2)22]=]=15.7415.74YesApril 21, 2023DATA MINING CSE@HCST
97
![Page 98: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/98.jpg)
Genetic Algorithms (GA-Part-I)
April 21, 2023DATA MINING CSE@HCST
98
Genetic Algorithm: based on an analogy to biological evolution.
An initial population is created consisting of randomly generated rules- Each rule is represented by a string of bits
E.g., if A1 and ¬A2 then C2 can be encoded as 100
If an attribute has k > 2 values, k bits can be used
Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules and their offsprings.
The fitness of a rule is represented by its classification accuracy on a set of training examples.
Offsprings are generated by crossover and mutation.
The process continues until a population P evolves when each rule in P satisfies a prespecified threshold.
Slow but easily parallelizable.
![Page 99: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/99.jpg)
Genetic Algorithms (GA)
April 21, 2023DATA MINING CSE@HCST
99
![Page 100: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/100.jpg)
Genetic Algorithms (GA)
April 21, 2023DATA MINING CSE@HCST
100
![Page 101: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/101.jpg)
Genetic Algorithms (GA)
April 21, 2023DATA MINING CSE@HCST
101
![Page 102: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/102.jpg)
Genetic Algorithms (GA)
To use a genetic algorithm, you must encode solutions to your problem in a structure that can be stored in the computer.
This object is a genome (or chromosome). The genetic algorithm creates a population of genomes then
applies crossover and mutation to the individuals in the population to generate new individuals.
It uses various selection criteria so that it picks the best individuals for mating (and subsequent crossover).
Your objective function determines how 'good' each individual is.
April 21, 2023DATA MINING CSE@HCST
102
![Page 103: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/103.jpg)
Genetic Algorithms (GA)
The genetic algorithm is very simple, yet it performs well on many different types of problems.
But there are many ways to modify the basic algorithm, and many parameters that can be 'tweaked'.
Basically, if you get the objective function right, the representation right and the operators right, then variations on the genetic algorithm and its parameters will result in only minor improvements.
April 21, 2023DATA MINING CSE@HCST
103
![Page 104: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/104.jpg)
Representation
You can use any representation for the individual genomes in the genetic algorithm.
Holland worked primarily with strings of bits, but you can use arrays, trees, lists, or any other object.
But you must define genetic operators- (initialization, mutation, crossover, comparison) for any representation that you decide to use.
Remember that each individual must represent a complete solution to the problem you are trying to optimize.
April 21, 2023DATA MINING CSE@HCST
104
![Page 105: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/105.jpg)
April 21, 2023DATA MINING CSE@HCST
105
![Page 106: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/106.jpg)
Mutation operators
These are some sample tree mutation operators. You can use more than one operator during an
evolution. The mutation operator introduces a certain amount
of randomness to the search. It can help the search find solutions that crossover
alone might not encounter.
April 21, 2023DATA MINING CSE@HCST
106
![Page 107: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/107.jpg)
April 21, 2023DATA MINING CSE@HCST
107
![Page 108: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/108.jpg)
Crossover operators
These are some sample tree crossover operators. Typically crossover is defined so that two
individuals (the parents) combine to produce two more individuals (the children).
But you can define asexual crossover or single-child crossover as well.
The primary purpose of the crossover operator is to get genetic material from the previous generation to the subsequent generation.
April 21, 2023DATA MINING CSE@HCST
108
![Page 109: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/109.jpg)
April 21, 2023DATA MINING CSE@HCST
109
![Page 110: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/110.jpg)
Mutation operators
These are some sample list mutation operators. Notice that lists may be fixed or variable length. Also common are order-based lists in which the sequence is
important and nodes cannot be duplicated during the genetic operations.
You can use more than one operator during an evolution. The mutation operator introduces a certain amount of
randomness to the search. It can help the search find solutions that crossover alone
might not encounter.
April 21, 2023DATA MINING CSE@HCST
110
![Page 111: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/111.jpg)
April 21, 2023DATA MINING CSE@HCST
111
![Page 112: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/112.jpg)
April 21, 2023DATA MINING CSE@HCST112
Notice that lists may be fixed or variable length. Also common are order-based lists in which the sequence is important and nodes cannot be duplicated during the genetic operations. You can use more than one operator during an evolution. The mutation operator introduces a certain amount of randomness to the search. It can help the search find solutions that crossover alone might not encounter.
![Page 113: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/113.jpg)
April 21, 2023DATA MINING CSE@HCST
113
![Page 114: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/114.jpg)
Genetic Algorithms (GA)
Two of the most common genetic algorithm implementations are 'simple' and 'steady state'.
In simple state- It is a generational algorithm in which the entire population is replaced each generation.
The steady state genetic algorithm is used by the Genitor program. In this algorithm, only a few individuals are replaced each 'generation'. This type of replacement is often referred to as overlapping populations.
April 21, 2023DATA MINING CSE@HCST
114
![Page 115: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/115.jpg)
April 21, 2023DATA MINING CSE@HCST
115
http://lancet.mit.edu/mbwall/presentations
![Page 116: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/116.jpg)
April 21, 2023DATA MINING CSE@HCST
116
![Page 117: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/117.jpg)
Genetic AlgorithmGenetic Algorithms (GA-Part-I)
117
April 21, 2023DATA MINING CSE@HCST
![Page 118: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/118.jpg)
Outline
Introduction to Genetic Algorithm (GA) GA Components
Representation Recombination Mutation Parent Selection Survivor selection
Example
118
April 21, 2023DATA MINING CSE@HCST
![Page 119: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/119.jpg)
Introduction to GA (1)119
Calculus Base Techniques
Fibonacci
Search Techniqes
Guided random search techniqes
Enumerative Techniqes
BFSDFS Dynamic Programmin
g
Tabu Search
Hill Climbi
ng
Simulated Anealing
Evolutionary Algorithms
Genetic Programming
Genetic Algorithms
Sort
April 21, 2023DATA MINING CSE@HCST
![Page 120: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/120.jpg)
Introduction to GA (2)
“Genetic Algorithms are good at taking large, potentially huge search spaces and navigating them, looking for optimal combinations of things, solutions you might not otherwise find in a lifetime.”- Salvatore Mangano, Computer Design, May 1995.
Originally developed by John Holland (1975) The genetic algorithm (GA) is a search heuristic that
mimics the process of natural evolution Uses concepts of “Natural Selection” and “Genetic
Inheritance” (Darwin 1859)
120
April 21, 2023DATA MINING CSE@HCST
![Page 121: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/121.jpg)
Use of GA
Widely-used in business, science and engineering Optimization and Search Problems Scheduling and Timetabling
121
April 21, 2023DATA MINING CSE@HCST
![Page 122: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/122.jpg)
Let’s Learn Biology (1)
Our body is made up of trillions of cells. Each cell has a core structure (nucleus) that contains your chromosomes.
Each chromosome is made up of tightly coiled strands of deoxyribonucleic acid (DNA). Genes are segments of DNA that determine specific traits, such as eye or hair color. You have more than 20,000 genes.
A gene mutation is an alteration in your DNA. It can be inherited or acquired during your lifetime, as cells age or are exposed to certain chemicals. Some changes in your genes result in genetic disorders.
122
April 21, 2023DATA MINING CSE@HCST
![Page 123: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/123.jpg)
Let’s Learn Biology (2) 123
Source: http://www.riversideonline.com/health_reference/Tools/DS00549.cfm
1101101April 21, 2023DATA MINING CSE@HCST
![Page 124: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/124.jpg)
Let’s Learn Biology (3) 124
April 21, 2023DATA MINING CSE@HCST
![Page 125: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/125.jpg)
Let’s Learn Biology (4)
Natural Selection Darwin's theory of evolution: only the organisms best
adapted to their environment tend to survive and transmit their genetic characteristics in increasing numbers to succeeding generations while those less adapted tend to be eliminated.
125
Source: http://www.bbc.co.uk/programmes/p0022nyy
April 21, 2023DATA MINING CSE@HCST
![Page 126: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/126.jpg)
GA is inspired from Nature
A genetic algorithm maintains a population of candidate solutions for the problem at hand,and makes it evolve by iteratively applying a set of stochastic operators
126
April 21, 2023DATA MINING CSE@HCST
![Page 127: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/127.jpg)
Nature VS GA
The computer model introduces simplifications (relative to the real biological mechanisms),
BUT
surprisingly complex and interesting structures have emerged out of evolutionary algorithms
127
April 21, 2023DATA MINING CSE@HCST
![Page 128: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/128.jpg)
High-level Algorithm
produce an initial population of individuals evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while
128
April 21, 2023DATA MINING CSE@HCST
![Page 129: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/129.jpg)
GA Components129
Source: http://www.engineering.lancs.ac.ukApril 21, 2023DATA MINING CSE@HCST
![Page 130: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/130.jpg)
GA Components With Example
The MAXONE problem : Suppose we want to maximize the number of ones in a string of L binary digits
It may seem trivial because we know the answer in advance
However, we can think of it as maximizing the number of correct answers, each encoded by 1, to L yes/no difficult questions`
130
April 21, 2023DATA MINING CSE@HCST
![Page 131: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/131.jpg)
GA Components: Representation
Encoding An individual is encoded (naturally) as a string of l
binary digits Let’s say L = 10. Then, 1 = 0000000001 (10 bits)
131
April 21, 2023DATA MINING CSE@HCST
![Page 132: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/132.jpg)
Algorithm
produce an initial population of individuals
evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while
132
April 21, 2023DATA MINING CSE@HCST
![Page 133: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/133.jpg)
Initial Population
We start with a population of n random strings. Suppose that l = 10 and n = 6
We toss a fair coin 60 times and get the following initial population:
s1 = 1111010101
s2 = 0111000101
s3 = 1110110101
s4 = 0100010011
s5 = 1110111101
s6 = 0100110000
133
April 21, 2023DATA MINING CSE@HCST
![Page 134: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/134.jpg)
Algorithm
produce an initial population of individuals
evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while
134
April 21, 2023DATA MINING CSE@HCST
![Page 135: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/135.jpg)
Fitness Function: f()
We toss a fair coin 60 times and get the following initial population:
s1 = 1111010101 f (s1) = 7
s2 = 0111000101 f (s2) = 5
s3 = 1110110101 f (s3) = 7
s4 = 0100010011 f (s4) = 4
s5 = 1110111101 f (s5) = 8
s6 = 0100110000 f (s6) = 3 --------------------------------------------------- =
34
135
April 21, 2023DATA MINING CSE@HCST
![Page 136: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/136.jpg)
Algorithm
produce an initial population of individuals evaluate the fitness of all individuals while termination condition not met do
select fitter individuals for reproduction recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while
136
April 21, 2023DATA MINING CSE@HCST
![Page 137: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/137.jpg)
Selection (1)
Next we apply fitness proportionate selection with the roulette wheel method:
We repeat the extraction as many times as the number of individuals
we need to have the same parent population size (6 in our case)
137
Individual i will have a probability to be chosen Individual i will have a probability to be chosen
i
if
if
)(
)(
2211nn
33
Area is Proportional to fitness value
44
April 21, 2023DATA MINING CSE@HCST
![Page 138: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/138.jpg)
Selection (2)
Suppose that, after performing selection, we get the following population:
s1` = 1111010101 (s1) s2` = 1110110101 (s3) s3` = 1110111101 (s5) s4` = 0111000101 (s2) s5` = 0100010011 (s4) s6` = 1110111101 (s5)
138
April 21, 2023DATA MINING CSE@HCST
![Page 139: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/139.jpg)
Algorithm
produce an initial population of individuals evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction
recombine between individuals mutate individuals evaluate the fitness of the modified individuals generate a new population End while
139
April 21, 2023DATA MINING CSE@HCST
![Page 140: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/140.jpg)
Recombination (1)
aka Crossover For each couple we decide according to
crossover probability (for instance 0.6) whether to actually perform crossover or not
Suppose that we decide to actually perform crossover only for couples (s1`, s2`) and (s5`, s6`).
For each couple, we randomly extract a crossover point, for instance 2 for the first and 5 for the second
140
April 21, 2023DATA MINING CSE@HCST
![Page 141: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/141.jpg)
Recombination (2)141
April 21, 2023DATA MINING CSE@HCST
![Page 142: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/142.jpg)
Algorithm
produce an initial population of individuals evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals
mutate individuals evaluate the fitness of the modified individuals generate a new population End while
142
April 21, 2023DATA MINING CSE@HCST
![Page 143: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/143.jpg)
Mutation (1)143
Before applying mutation:
s1`` = 1110110101
s2`` = 1111010101
s3`` = 1110111101
s4`` = 0111000101
s5`` = 0100011101
s6`` = 1110110011
After applying mutation:
s1``` = 1110100101
s2``` = 1111110100
s3``` = 1110101111
s4``` = 0111000101
s5``` = 0100011101
s6``` = 1110110001 April 21, 2023DATA MINING CSE@HCST
![Page 144: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/144.jpg)
Mutation (2)
The final step is to apply random mutation: for each bit that we are to copy to the new population we allow a small probability of error (for instance 0.1)
Causes movement in the search space(local or global)
Restores lost information to the population
144
April 21, 2023DATA MINING CSE@HCST
![Page 145: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/145.jpg)
High-level Algorithm
produce an initial population of individuals evaluate the fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine between individuals mutate individuals
evaluate the fitness of the modified individuals
generate a new population End while
145
April 21, 2023DATA MINING CSE@HCST
![Page 146: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/146.jpg)
Fitness of New Population
After Applying Mutation: s1``` = 1110100101 f (s1```) = 6
s2``` = 1111110100 f (s2```) = 7
s3``` = 1110101111 f (s3```) = 8
s4``` = 0111000101 f (s4```) = 5
s5``` = 0100011101 f (s5```) = 5
s6``` = 1110110001 f (s6```) = 6 -------------------------------------------------------------
37
146
April 21, 2023DATA MINING CSE@HCST
![Page 147: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/147.jpg)
Example (End)
In one generation, the total population fitness changed from 34 to 37, thus improved by ~9%
At this point, we go through the same process all over again, until a stopping criterion is met
147
April 21, 2023DATA MINING CSE@HCST
![Page 148: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/148.jpg)
Distribution of Individuals
Distribution of Individuals in Generation 0
Distribution of Individuals in Generation N
April 21, 2023DATA MINING CSE@HCST
148
![Page 149: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/149.jpg)
Issues
Choosing basic implementation issues: representation population size, mutation rate, ... selection, deletion policies crossover, mutation operators
Termination Criteria Performance, scalability Solution is only as good as the evaluation function (often
hardest part)
149
April 21, 2023DATA MINING CSE@HCST
![Page 150: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/150.jpg)
When to Use a GA
Alternate solutions are too slow or overly complicated Need an exploratory tool to examine new approaches Problem is similar to one that has already been
successfully solved by using a GA Want to hybridize with an existing solution Benefits of the GA technology meet key problem
requirements
April 21, 2023DATA MINING CSE@HCST
150
![Page 151: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/151.jpg)
Conclusion
Inspired from Nature Has many areas of Applications GA is powerful
151
April 21, 2023DATA MINING CSE@HCST
![Page 152: DATA MINING: CONCEPTS AND TECHNIQUES UNIT-III Part-I Classification and Predictions September 10, 2015 DATA MINING CSE@HCST 1](https://reader036.vdocuments.site/reader036/viewer/2022062801/56649e305503460f94b2172e/html5/thumbnails/152.jpg)
April 21, 2023DATA MINING CSE@HCST152
END