Download - Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo [email protected] Contents Process

Introductionto Big Data

Chapter 9 (Week 5)Data Cleansing

DCCS208(02) Korea University 2019 Fall

Asst. Prof. Minseok [email protected]

Contents

Process of data cleansing

Review

Data Cleansing1. Data aggregation

NA, InF, NaN

Variable Check

Raw data check

Missing value treatment

Outlier treatment

Feature Engineering

01ReviewData preprocessing

copyrightⓒ 2018 All rights reserved by Korea University 4

Data Preprocessing (Review)Various data preprcessing methods

Aggregation

Sampling

Dimensionality reduction

Feature selection

Feature extraction

...

Data preprocessing is an important step in the data mining process.The phrase "garbage in, garbage out" is particularly applicable to datamining and machine learning projects.


Quality Control for Data (Review)Data cleaning

Data cleansing is the process of detecting and correcting (or removing)corrupt or inaccurate, incomplete, incorrect, or irrelevant parts of thedata and then replacing, modifying, or deleting the dirty or coarse data.


Distance Measures (Review)Diverse Distance Metric

L1-norm (Manhattan Distance)

L2-norm (Euclidean Distance)

Lmax-norm (ChebyShev Distance)

Edit Distance (Hamming Distance)

Pearson’s Correlation

Spearman’s Correlation

02Data cleansingExploratory Data Analysis (EDA)


Data AggregationData combining and tabulation (data rectangling)

Name Job AgeGildong Hong Football player 20Samsun Kim Football player 21

Heungmin Son Football player 22Cristiano Ronaldo Robber 30

Lionel Messi Football player 30Kylian Mbappé Football player 20

[Data 1]Name Gender Goal2020

Heungmin Son Male 40Cristiano Ronaldo Male 0

Lionel Messi Male 42Kylian Mbappé Male 28

[Data 2]

Let's combine the two different data


Data AggregationData combining and tabulation (data rectangling)

What are the prerequisties for the data combining?

What can happen if we can dombine different dataset?

• Assign wrong value by typos of unique key

• Missing value

• Singleton

• Duplicated features

...


NA, NULL, InF & NaNImportant words

NA (Not Available)

NULL

Inf

NaN (Not a Number)

# Missing

# Undefined

# Infinite (i.e. 3/0)

# Not a number (i.e. Inf/Inf)


Step of data cleansingGeneral process

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Once you have finished collecting data and have obtained the tabular data from the total union of the data you need, move on to the next step.

Generally, the data is cleansed through the following process.


Step of data cleansingVariable Check

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Check the type of each variable (categories or continuous), and the data type of the variable (Date, Character, Numeric, etc.).

Depending on the type of variable, the results of the analysis are completely different when fitting the model.

You should also check that the conceptual variable typematches the variable type in your program (Programmer’sprivilege).


Step of data cleansingRaw data check

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Single Variable (Feature) analysis

• This step is to check each variable independently.

• Use the histogram or boxplot to see the distribution of each variable along with its summary statistic such as mean, mode, and median.



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Summary Statistics

• Summary statistics are numbers that summarizes properties of the data.

• Most summary statistics can be calculated in a single pass through the data.

• Frequency, Mean, Variance, Max, Min, Mode, SD, and Median.



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Frequency, Mean, Variance, Max, Min, Mode, SD, and Median

• The count (or percentage) of how many times the specific value appears.

• Exmaple) The frequency of the Female property value in the Gender variable is 0.5 (or 50%).



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering


• Attribute value with the highest frequency in specific variable

• Example) ‘Kim' is the most used surname in Korea. == ‘Kim’ is mode value in surname variable for South Korea



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering


• Two ways to calculate the central position of a continuous variable.



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Visualization

• It is a way of presenting data in visual form such as graphics or figures.

• The purpose of visualization is for humans to interpret the visualized information and form an internal model of the information.

• When visualizing and expressing a large amount of data, one can find (1) general patterns or trends, and (2) outliers or abnormal patterns in the data.



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Why use a Histogram

• To summarize data from a process that has been collected over a period of time, and graphically present its frequency distribution in bar form.



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

What does a Histogram Do?

• Displays large amounts of data that are difficult to interpret in tabular form.

• Shows the relative frequency of occurrence of the various data values.

• Reveals the centering, variation, and shape of the data.

• To check distribution assumption of the statistical analysis.



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Check combinations between two variables

• This step analyzes the relationship between two varaibles.

• You can choose the appropriate visualization and analytical methods according to the types of two variables.

Variable types Visualization Analytical way

Cont. vs Cont. Scatter plotLine plotetc.,

CorrelationLinear regressionetc.,

Cate. vs Cate. Cumulative bar graph

Chi-square testIndependence testetc.,

Cont. vs Cate. Histogram, boxplot

Z-test, t-test, ANOVA, ..., etc



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Check relationship over three variables

• In general, it does not perform well in the QC process.

• But, sometimes it is necessary to look at relationships of more than three variables.

• Let’s look at an example on the whiteboard.


Step of data cleansingMissing value treatment

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

If we create a model with missing values, the accuracy of themodel is compromised because the relationships between thevariables can be distorted.

Depending on whether missing values occur randomly orsysthematically, the way that missing values are handledvaries slightly.



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Delection• Delete all observations with missing values (Delete All,

Listwise Deletion).

• Partial delection (it is way to delete missing values only when used in downstream modeling).

• All deletions are easy way, but the total number of observations will be reduced, making the model less valid.

• Partial deletion has the disadvantage of increasing administrative costs because the variables vary from model to model.



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Replace with other values (mean, mode, median)

• Example) If the mean value of male height is 173 and the mean height of female height is 158, the missing value of male observation is replaced with 173.

• In this approach, there is a possibility that the model will be distorted because it randomly chooses which environmental variables to choose as similar types.



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Insert predicted values

• It is a method of predicting and assigning them using statistical methods (regression modeling) or machine learning methods (clustering or supervised leraning methods).

• This is better than method of replacement with summary statistic (because the subjectivity of the analyst falls out).

• However, the same limitation still exist because it is not the actual observed value.


Step of data cleansingOutlier treatment

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

Outliers are observations away from the main cluster that are likely to distort the model.

It is a very relative concept.



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

An easy and simple way to find outliers is to visualize the distribution of the variables.

In general, we will use Boxplot or Histogram for one variable and use Scatter plot for two variables.



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

The visualization approaches is intuitive, but it is also arbitrary (Subjective).

That’s why it’s preferable to employ a statistical approaches to find and remove outliers (i.e. Cook’s D value).



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

The first way to handle outliers is to delete them.

• If the outlier is caused by human error (i.e. typo, unrealistic response), we can delete the observation.

The second way is replacement.

• Replace observations with other values (mean, etc.) instead of deleting them.



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

The third way is to variabalize the outliers

• Example) variablize whether or not sample is engaged in a profession.

Years of service

Sala

ry



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

The last way is sub-sampling

• If the samples belonging to a category are outliers, it is advisable to analyze them separately from the rest.


Step of data cleansingFeature Engineering

Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

A series of processes that add information to data using existing variables.

• Scaling (Normalization)

• Binning (Categorization)

• Transform

• Dummy



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering


• Scaling (Normalization)

When we want to change the unit of a variable

If the distribution of variables is biased



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering


• Binning (Categorization)

Creating continuous variables into categorical variables.

Warning) Information loss!



Variable Check

Raw data check

Missing value

treatmentOutlier

treatmentFeature

Engineering

A series of processes that add information to data using existing variables.• Dummy Convert categorical variables to numerical variable.

Mainly used to change ordinal (Categorical) variables usch as cancer classification into numerical variables.


One picture is worth thousand words. Data cleansing is harder than you think.

Especially, if the data size is large, you will want to cry.

This can be felt by doing R-programming with real data.


One picture is worth thousand words.

Example of SAC (Split-Apply-Conbine) process


All things is already impleneted in R There is a R package for data cleansing.

Once you understand how the data changes, just one line is enough with R-programming.

End of Slide

Download - Introduction to Big Data...Introduction to Big Data Chapter 9 (Week 5) Data Cleansing DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo [email protected] Contents Process

Top Related