Introductionto Big Data
Chapter 9 (Week 5)Data Cleansing
DCCS208(02) Korea University 2019 Fall
Asst. Prof. Minseok [email protected]
Contents
Process of data cleansing
Review
Data Cleansing1. Data aggregation
NA, InF, NaN
Variable Check
Raw data check
Missing value treatment
Outlier treatment
Feature Engineering
01ReviewData preprocessing
copyrightⓒ 2018 All rights reserved by Korea University 4
Data Preprocessing (Review)Various data preprcessing methods
Aggregation
Sampling
Dimensionality reduction
Feature selection
Feature extraction
...
Data preprocessing is an important step in the data mining process.The phrase "garbage in, garbage out" is particularly applicable to datamining and machine learning projects.
copyrightⓒ 2018 All rights reserved by Korea University 5
Quality Control for Data (Review)Data cleaning
Data cleansing is the process of detecting and correcting (or removing)corrupt or inaccurate, incomplete, incorrect, or irrelevant parts of thedata and then replacing, modifying, or deleting the dirty or coarse data.
copyrightⓒ 2018 All rights reserved by Korea University 6
Distance Measures (Review)Diverse Distance Metric
L1-norm (Manhattan Distance)
L2-norm (Euclidean Distance)
Lmax-norm (ChebyShev Distance)
Edit Distance (Hamming Distance)
Pearson’s Correlation
Spearman’s Correlation
02Data cleansingExploratory Data Analysis (EDA)
copyrightⓒ 2018 All rights reserved by Korea University 8
Data AggregationData combining and tabulation (data rectangling)
Name Job AgeGildong Hong Football player 20Samsun Kim Football player 21
Heungmin Son Football player 22Cristiano Ronaldo Robber 30
Lionel Messi Football player 30Kylian Mbappé Football player 20
[Data 1]Name Gender Goal2020
Heungmin Son Male 40Cristiano Ronaldo Male 0
Lionel Messi Male 42Kylian Mbappé Male 28
[Data 2]
Let's combine the two different data
copyrightⓒ 2018 All rights reserved by Korea University 9
Data AggregationData combining and tabulation (data rectangling)
What are the prerequisties for the data combining?
What can happen if we can dombine different dataset?
• Assign wrong value by typos of unique key
• Missing value
• Singleton
• Duplicated features
...
copyrightⓒ 2018 All rights reserved by Korea University 10
NA, NULL, InF & NaNImportant words
NA (Not Available)
NULL
Inf
NaN (Not a Number)
# Missing
# Undefined
# Infinite (i.e. 3/0)
# Not a number (i.e. Inf/Inf)
copyrightⓒ 2018 All rights reserved by Korea University 11
Step of data cleansingGeneral process
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
Once you have finished collecting data and have obtained the tabular data from the total union of the data you need, move on to the next step.
Generally, the data is cleansed through the following process.
copyrightⓒ 2018 All rights reserved by Korea University 12
Step of data cleansingVariable Check
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
Check the type of each variable (categories or continuous), and the data type of the variable (Date, Character, Numeric, etc.).
Depending on the type of variable, the results of the analysis are completely different when fitting the model.
You should also check that the conceptual variable typematches the variable type in your program (Programmer’sprivilege).
copyrightⓒ 2018 All rights reserved by Korea University 13
Step of data cleansingRaw data check
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
Single Variable (Feature) analysis
• This step is to check each variable independently.
• Use the histogram or boxplot to see the distribution of each variable along with its summary statistic such as mean, mode, and median.
copyrightⓒ 2018 All rights reserved by Korea University 14
Step of data cleansingRaw data check
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
Summary Statistics
• Summary statistics are numbers that summarizes properties of the data.
• Most summary statistics can be calculated in a single pass through the data.
• Frequency, Mean, Variance, Max, Min, Mode, SD, and Median.
copyrightⓒ 2018 All rights reserved by Korea University 15
Step of data cleansingRaw data check
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
Frequency, Mean, Variance, Max, Min, Mode, SD, and Median
• The count (or percentage) of how many times the specific value appears.
• Exmaple) The frequency of the Female property value in the Gender variable is 0.5 (or 50%).
copyrightⓒ 2018 All rights reserved by Korea University 16
Step of data cleansingRaw data check
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
Frequency, Mean, Variance, Max, Min, Mode, SD, and Median
• Attribute value with the highest frequency in specific variable
• Example) ‘Kim' is the most used surname in Korea. == ‘Kim’ is mode value in surname variable for South Korea
copyrightⓒ 2018 All rights reserved by Korea University 17
Step of data cleansingRaw data check
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
Frequency, Mean, Variance, Max, Min, Mode, SD, and Median
• Two ways to calculate the central position of a continuous variable.
copyrightⓒ 2018 All rights reserved by Korea University 18
Step of data cleansingRaw data check
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
Visualization
• It is a way of presenting data in visual form such as graphics or figures.
• The purpose of visualization is for humans to interpret the visualized information and form an internal model of the information.
• When visualizing and expressing a large amount of data, one can find (1) general patterns or trends, and (2) outliers or abnormal patterns in the data.
copyrightⓒ 2018 All rights reserved by Korea University 19
Step of data cleansingRaw data check
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
Why use a Histogram
• To summarize data from a process that has been collected over a period of time, and graphically present its frequency distribution in bar form.
copyrightⓒ 2018 All rights reserved by Korea University 20
Step of data cleansingRaw data check
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
What does a Histogram Do?
• Displays large amounts of data that are difficult to interpret in tabular form.
• Shows the relative frequency of occurrence of the various data values.
• Reveals the centering, variation, and shape of the data.
• To check distribution assumption of the statistical analysis.
copyrightⓒ 2018 All rights reserved by Korea University 21
Step of data cleansingRaw data check
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
Check combinations between two variables
• This step analyzes the relationship between two varaibles.
• You can choose the appropriate visualization and analytical methods according to the types of two variables.
Variable types Visualization Analytical way
Cont. vs Cont. Scatter plotLine plotetc.,
CorrelationLinear regressionetc.,
Cate. vs Cate. Cumulative bar graph
Chi-square testIndependence testetc.,
Cont. vs Cate. Histogram, boxplot
Z-test, t-test, ANOVA, ..., etc
copyrightⓒ 2018 All rights reserved by Korea University 22
Step of data cleansingRaw data check
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
Check relationship over three variables
• In general, it does not perform well in the QC process.
• But, sometimes it is necessary to look at relationships of more than three variables.
• Let’s look at an example on the whiteboard.
copyrightⓒ 2018 All rights reserved by Korea University 23
Step of data cleansingMissing value treatment
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
If we create a model with missing values, the accuracy of themodel is compromised because the relationships between thevariables can be distorted.
Depending on whether missing values occur randomly orsysthematically, the way that missing values are handledvaries slightly.
copyrightⓒ 2018 All rights reserved by Korea University 24
Step of data cleansingMissing value treatment
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
Delection• Delete all observations with missing values (Delete All,
Listwise Deletion).
• Partial delection (it is way to delete missing values only when used in downstream modeling).
• All deletions are easy way, but the total number of observations will be reduced, making the model less valid.
• Partial deletion has the disadvantage of increasing administrative costs because the variables vary from model to model.
copyrightⓒ 2018 All rights reserved by Korea University 25
Step of data cleansingMissing value treatment
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
Replace with other values (mean, mode, median)
• Example) If the mean value of male height is 173 and the mean height of female height is 158, the missing value of male observation is replaced with 173.
• In this approach, there is a possibility that the model will be distorted because it randomly chooses which environmental variables to choose as similar types.
copyrightⓒ 2018 All rights reserved by Korea University 26
Step of data cleansingMissing value treatment
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
Insert predicted values
• It is a method of predicting and assigning them using statistical methods (regression modeling) or machine learning methods (clustering or supervised leraning methods).
• This is better than method of replacement with summary statistic (because the subjectivity of the analyst falls out).
• However, the same limitation still exist because it is not the actual observed value.
copyrightⓒ 2018 All rights reserved by Korea University 27
Step of data cleansingOutlier treatment
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
Outliers are observations away from the main cluster that are likely to distort the model.
It is a very relative concept.
copyrightⓒ 2018 All rights reserved by Korea University 28
Step of data cleansingOutlier treatment
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
An easy and simple way to find outliers is to visualize the distribution of the variables.
In general, we will use Boxplot or Histogram for one variable and use Scatter plot for two variables.
copyrightⓒ 2018 All rights reserved by Korea University 29
Step of data cleansingOutlier treatment
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
The visualization approaches is intuitive, but it is also arbitrary (Subjective).
That’s why it’s preferable to employ a statistical approaches to find and remove outliers (i.e. Cook’s D value).
copyrightⓒ 2018 All rights reserved by Korea University 30
Step of data cleansingOutlier treatment
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
The first way to handle outliers is to delete them.
• If the outlier is caused by human error (i.e. typo, unrealistic response), we can delete the observation.
The second way is replacement.
• Replace observations with other values (mean, etc.) instead of deleting them.
copyrightⓒ 2018 All rights reserved by Korea University 31
Step of data cleansingOutlier treatment
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
The third way is to variabalize the outliers
• Example) variablize whether or not sample is engaged in a profession.
Years of service
Sala
ry
copyrightⓒ 2018 All rights reserved by Korea University 32
Step of data cleansingOutlier treatment
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
The last way is sub-sampling
• If the samples belonging to a category are outliers, it is advisable to analyze them separately from the rest.
copyrightⓒ 2018 All rights reserved by Korea University 33
Step of data cleansingFeature Engineering
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
A series of processes that add information to data using existing variables.
• Scaling (Normalization)
• Binning (Categorization)
• Transform
• Dummy
copyrightⓒ 2018 All rights reserved by Korea University 34
Step of data cleansingFeature Engineering
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
A series of processes that add information to data using existing variables.
• Scaling (Normalization)
When we want to change the unit of a variable
If the distribution of variables is biased
copyrightⓒ 2018 All rights reserved by Korea University 35
Step of data cleansingFeature Engineering
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
A series of processes that add information to data using existing variables.
• Binning (Categorization)
Creating continuous variables into categorical variables.
Warning) Information loss!
copyrightⓒ 2018 All rights reserved by Korea University 36
Step of data cleansingFeature Engineering
Variable Check
Raw data check
Missing value
treatmentOutlier
treatmentFeature
Engineering
A series of processes that add information to data using existing variables.• Dummy Convert categorical variables to numerical variable.
Mainly used to change ordinal (Categorical) variables usch as cancer classification into numerical variables.
copyrightⓒ 2018 All rights reserved by Korea University 37
One picture is worth thousand words. Data cleansing is harder than you think.
Especially, if the data size is large, you will want to cry.
This can be felt by doing R-programming with real data.
copyrightⓒ 2018 All rights reserved by Korea University 38
One picture is worth thousand words.
Example of SAC (Split-Apply-Conbine) process
copyrightⓒ 2018 All rights reserved by Korea University 39
One picture is worth thousand words.
copyrightⓒ 2018 All rights reserved by Korea University 40
One picture is worth thousand words.
copyrightⓒ 2018 All rights reserved by Korea University 41
One picture is worth thousand words.
copyrightⓒ 2018 All rights reserved by Korea University 42
One picture is worth thousand words.
copyrightⓒ 2018 All rights reserved by Korea University 43
One picture is worth thousand words.
copyrightⓒ 2018 All rights reserved by Korea University 44
One picture is worth thousand words.
copyrightⓒ 2018 All rights reserved by Korea University 45
All things is already impleneted in R There is a R package for data cleansing.
Once you understand how the data changes, just one line is enough with R-programming.
End of Slide