file 498 doc 17 02dm datapreprocessing 2
DESCRIPTION
TRANSCRIPT
![Page 1: File 498 Doc 17 02dm Datapreprocessing 2](https://reader034.vdocuments.site/reader034/viewer/2022042613/545c09b4b1af9f320a8b461e/html5/thumbnails/1.jpg)
1
ผู้��ช่�วยศาสตราจารย�จ�ร�ฎฐา ภู�บุ�ญอบุ ([email protected],
08-9275-9797)
DATA DATA PREPROCEPREPROCESSINGSSING
![Page 2: File 498 Doc 17 02dm Datapreprocessing 2](https://reader034.vdocuments.site/reader034/viewer/2022042613/545c09b4b1af9f320a8b461e/html5/thumbnails/2.jpg)
2
1. WHY DO WE NEED TO PREPROCESS THE DATA?
Fields that are obsolete or redundantMissing valuesOutliersData in a form not suitable for data mining modelsValues not consistent with policy or common sense
![Page 3: File 498 Doc 17 02dm Datapreprocessing 2](https://reader034.vdocuments.site/reader034/viewer/2022042613/545c09b4b1af9f320a8b461e/html5/thumbnails/3.jpg)
3
2. DATA CLEANING Can You Find Any Problems in This Tiny Data Set?
Cust ID Zip
Gender
Income
Age
Marital
Status
Transaction
Amount
1001
10048
M 75000
C M 5000
1002
J2S7K7
F -40000
40 W 4000
1003
90210
1000000
0
45 S 7000
1004
6269
M 50000
0 S 1000
1005
55101
F 99999
30 D 3000
![Page 4: File 498 Doc 17 02dm Datapreprocessing 2](https://reader034.vdocuments.site/reader034/viewer/2022042613/545c09b4b1af9f320a8b461e/html5/thumbnails/4.jpg)
4
3. The three main types of problem data Errors
A recording errorA typing errorA transcription errorAn inversionA repetitionA deliberate error
OutliersMissing observations
![Page 5: File 498 Doc 17 02dm Datapreprocessing 2](https://reader034.vdocuments.site/reader034/viewer/2022042613/545c09b4b1af9f320a8b461e/html5/thumbnails/5.jpg)
5
4. HANDLING MISSING DATA
Some of our field values are missing!
![Page 6: File 498 Doc 17 02dm Datapreprocessing 2](https://reader034.vdocuments.site/reader034/viewer/2022042613/545c09b4b1af9f320a8b461e/html5/thumbnails/6.jpg)
6
4. HANDLING MISSING DATA (cont’d)Insightful Miner offers a choice of
replacement values for missing data:
Replace the missing value with some constant, specified by the analyst.Replace the missing value with the field mean (for numerical variables) or the mode (for categorical variables).Replace the missing values with a value generated at random from the variable distribution observed.
![Page 7: File 498 Doc 17 02dm Datapreprocessing 2](https://reader034.vdocuments.site/reader034/viewer/2022042613/545c09b4b1af9f320a8b461e/html5/thumbnails/7.jpg)
7
4. HANDLING MISSING DATA (cont’d) Replacing missing field values with user-defined constants
![Page 8: File 498 Doc 17 02dm Datapreprocessing 2](https://reader034.vdocuments.site/reader034/viewer/2022042613/545c09b4b1af9f320a8b461e/html5/thumbnails/8.jpg)
8
4. HANDLING MISSING DATA (cont’d) Replacing missing field values with means or modes
![Page 9: File 498 Doc 17 02dm Datapreprocessing 2](https://reader034.vdocuments.site/reader034/viewer/2022042613/545c09b4b1af9f320a8b461e/html5/thumbnails/9.jpg)
9
4. HANDLING MISSING DATA (cont’d) Replacing missing field values with random draws from the distribution of the variable
![Page 10: File 498 Doc 17 02dm Datapreprocessing 2](https://reader034.vdocuments.site/reader034/viewer/2022042613/545c09b4b1af9f320a8b461e/html5/thumbnails/10.jpg)
10
5. GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS
Histogram of vehicle weights: can you find the outlier?
![Page 11: File 498 Doc 17 02dm Datapreprocessing 2](https://reader034.vdocuments.site/reader034/viewer/2022042613/545c09b4b1af9f320a8b461e/html5/thumbnails/11.jpg)
11
5. GRAPHICAL METHODS FOR IDENTIFYING OUTLIERS
Scatter plot of mpg against weightlbs shows two outliers
![Page 12: File 498 Doc 17 02dm Datapreprocessing 2](https://reader034.vdocuments.site/reader034/viewer/2022042613/545c09b4b1af9f320a8b461e/html5/thumbnails/12.jpg)
12
6. DATA TRANSFORMATION
Histogram of time-to-60, with summary statistics
![Page 13: File 498 Doc 17 02dm Datapreprocessing 2](https://reader034.vdocuments.site/reader034/viewer/2022042613/545c09b4b1af9f320a8b461e/html5/thumbnails/13.jpg)
13
6. DATA TRANSFORMATION (ต่�อ)
Min–Max Normalization
For the field minimum (8 seconds)
For the variable average (15.548 seconds)
For the variable maximum (25 seconds)
)min()max(
)min(*
XX
XXX
0825
88*
X
444.0825
8548.15*
X
0.1825
825*
X
min–max normalization
values will range from zero to one
![Page 14: File 498 Doc 17 02dm Datapreprocessing 2](https://reader034.vdocuments.site/reader034/viewer/2022042613/545c09b4b1af9f320a8b461e/html5/thumbnails/14.jpg)
14
6. DATA TRANSFORMATION (ต่�อ)
Z-Score Standardization
For the field minimum (8 seconds)
For the variable average (15.548 seconds)
For the variable maximum (25 seconds)
)(
)(*
XSD
XmeanXX
593.2911.2
548.158*
X
Z-score standardization values will usually range between –4
and 40
911.2
548.15548.15*
X
247.3911.2
548.1525*
X
![Page 15: File 498 Doc 17 02dm Datapreprocessing 2](https://reader034.vdocuments.site/reader034/viewer/2022042613/545c09b4b1af9f320a8b461e/html5/thumbnails/15.jpg)
15
6. DATA TRANSFORMATION (ต่�อ)Histogram of time-to-60 after Z-score standardization