outline introduction descriptive data summarization data cleaning missing value noise data data...

31
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Upload: blake-york

Post on 01-Jan-2016

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Outline Introduction Descriptive Data Summarization Data Cleaning

Missing value Noise data

Data Integration Redundancy

Data Transformation

Page 2: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Data Cleaning

Importance “Data cleaning is one of the three

biggest problems in data warehousing”—Ralph Kimball

“Data cleaning is the number one problem in data warehousing”—DCI survey

Page 3: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Data Cleaning

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy

data

Page 4: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Missing Data

Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time

of entry not register history or changes of the data

It is important to note that, a missing value may not always imply an error. (for example, Null-allow attri. )

Page 5: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

How to Handle Missing Data?

Ignore the tuple: usually done when class

label is missing (assuming the tasks in

classification—not effective when the

percentage of missing values per attribute

varies considerably.

Fill in the missing value manually: tedious

+ infeasible

Page 6: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

How to Handle Missing Data?

Fill in it automatically with a global constant : e.g., “unknown”, a new

class?!

the attribute mean

the attribute mean for all samples belonging to

the same class: smarter

the most probable value: inference-based such

as Bayesian formula or decision tree

Page 7: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Outline Introduction Descriptive Data Summarization Data Cleaning

Missing value Noise data

Data Integration Redundancy

Data Transformation

Page 8: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Noisy Data

Noise: random error or variance in a measured variable

How to Handle Noisy Data? Binning Regression Clustering

Page 9: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Binning Binnig methods smooth a sorted data

value by consulting its “neighborhood” First of all, we sort all the values Then, the sorted values are distributed into

a number of “buckets”, or “bins” Then we smooth the values by

Means (bin value is replace by mean value), or Medium (bin value is replace by medium value), or Boundaries (bin value is replace by the closest

boundary value)

Page 10: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Simple Discretization Methods: Binning Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34

* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

Page 11: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Regression

x

y

y = x + 1

X1

Y1

Y1’

Page 12: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Cluster Analysis

Page 13: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Outline Introduction Descriptive Data Summarization Data Cleaning

Missing value Noise data

Data Integration Redundancy

Data Transformation

Page 14: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Data integration

Data integration: Combines data from multiple sources

into a coherent store

Page 15: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Data integration problems Schema integration:

e.g., A.cust-id B.cust-# Integrate metadata from different sources

Detecting and resolving data value conflicts For the same real world entity, attribute values

from different sources are different Possible reasons: different representations,

different scales, e.g., metric vs. British units

Page 16: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Redundant data

Redundant data occur often when integration of multiple databases Object identification: The same attribute or

object may have different names in different databases

Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue

Page 17: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Redundant data

Redundant attributes may be able to be detected by correlation analysis

Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Page 18: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Pearson’s product moment coefficient

Correlation coefficient (also called Pearson’s product moment coefficient)

where n is the number of tuples, and are the respective means of A and B, σA and σB are the

respective standard deviation of A and B, and Σ(AB) is the sum of the AB cross-product.

BABA n

BAnAB

n

BBAAr BA )1(

)(

)1(

))((,

Page 19: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Pearson’s product moment coefficient The correlation coefficient is always

between -1 and +1. The closer the correlation is to +/-1, the closer to a perfect linear relationship. Here is how I tend to interpret correlations.

-1.0 to -0.7 strong negative association. -0.7 to -0.3 weak negative association. -0.3 to +0.3 little or no association. +0.3 to +0.7 weak positive association. +0.7 to +1.0 strong positive association.

Page 20: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Chi-Square Χ2 (chi-square) test

The larger the Χ2 value, the more likely the variables are related

Page 21: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Chi-Square Calculation: An Example Suppose a group of 1500 people was

surveyed.

The gender of each person was noted Male: 300 Female: 1200

We have two attributes: Gender Prefer-reading

Page 22: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Chi-Square Calculation: An Example

E11 = count (male)*count(fiction)/N = 300 * 450 / 1500 =90 E12 = count (male)*count(not_fiction)/N = 300 * 1050/ 1500

=90

93.507840

)8401000(

360

)360200(

210

)21050(

90

)90250( 22222

i

j

Male Female Sum (row)

Like science fiction

250(90)

200(360) 450

Not like science fiction

50(210)

1000(840) 1050

Sum(col.) 300 1200 1500

Page 23: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Chi-Square Calculation: An Example For this 2 by 2 table, the degree of freedom

are (2-1)(2-1)=1

For 1 degree of freedom, the Chi-Square value needed to reject the hypothesis at the 0.001 significance is 10.828

Since our value is above this, we can conclude that the gender and prefer_reading are (strongly) correlated for the given group of people

Page 24: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Outline Introduction Descriptive Data Summarization Data Cleaning

Missing value Noise data

Data Integration Redundancy

Data Transformation

Page 25: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Data Transformation

Data Transformation can involve the following: Smoothing: remove noise from the data,

including binning, regression and clustering

Aggregation Generalization Normalization Attribute construction

Page 26: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Normalization

Page 27: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Normalization

Page 28: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Normalization

Min-max normalization Z-score normalization Decimal normalization

Page 29: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Min-max normalization

Min-max normalization: to

[new_minA, new_maxA]

Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

716.00)00.1(000,12000,98

000,12600,73

Page 30: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Z-score normalization

Z-score normalization (μ: mean, σ: standard deviation):

Ex. Let μ = 54,000, σ = 16,000. Then

A

Avv

'

225.1000,16

000,54600,73

Page 31: Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation

Decimal normalization

Normalization by decimal scaling

Suppose the recorded value of A range from -986 to 917, the max absolute value is 986, so j = 3

j

vv

10' Where j is the smallest integer such that Max(|ν’|) < 1