# Business Systems Intelligence: 2. Data Preparation

Post on 22-Jan-2016

22 views

Embed Size (px)

DESCRIPTION

Dr. Brian Mac Namee ( www.comp.dit.ie/bmacnamee ). Business Systems Intelligence: 2. Data Preparation. Acknowledgments. These notes are based (heavily) on those provided by the authors to accompany Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber - PowerPoint PPT PresentationTRANSCRIPT

Business Systems Intelligence:2. Data PreparationDr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)

Data Mining: Concepts and Techniques

* of 25* of 58

AcknowledgmentsThese notes are based (heavily) on those provided by the authors to accompany Data Mining: Concepts & Techniques by Jiawei Han and Micheline KamberSome slides are also based on trainers kits provided byMore information about the book is available at: www-sal.cs.uiuc.edu/~hanj/bk2/And information on SAS is available at: www.sas.com

Data Mining: Concepts and Techniques

* of 25* of 58

Data PreprocessingToday we will look at data preprocessing and in particular:Descriptive data summarizationWhat kind of data are we talking about?Why preprocess data?Data cleaningData integration and transformationData reductionDiscretization and concept hierarchy generationSummary

Data Mining: Concepts and Techniques

* of 25* of 58

Descriptive Data SummarizationDescriptive data summarization techniques can be used to identify the typical properties of your dataWe will take a look at:Mean, median, mode and midrangeQuartiles, interquartile range and varianceWe will also introduce the notions of a distributive measure, an algebraic measure and a holistic measureFor all of these measures we will assume a set of attribute observations x1, x2, x3,,xN

Data Mining: Concepts and Techniques

* of 25* of 58

Measuring The Central TendencyThe central tendency of a data set can be considered a measure of the middle of the dataThe most simple, and commonly used is the arithmetic meanThe mean is calculated as:

The arithmetic mean can be upset by noise and outliers

Data Mining: Concepts and Techniques

* of 25* of 58

Measuring The Central Tendency (cont)For skewed data the median can be a better measure than the meanGiven a sorted numerical data set of N distinct values:If N is odd the median is the middle valueIf N is even it is the average of the two middle valuesThe mode of a data set is the value that occurs most frequently in the setThe mode may correspond to more than one value

Data Mining: Concepts and Techniques

* of 25* of 58

Measuring The Dispersion Of DataThe degree to which a data set is spread out is known as the dispersion or variance of the dataTypical measure of dispersion invclude:RangeInterquartile rangeFive-number summaryStandard deviationThe range of a set of observations is the difference between the largest and the smallest values

Data Mining: Concepts and Techniques

* of 25* of 58

Percentiles & QuartilesThe kth percentile of a set of data in numerical order is the value xi having the property that k percent of the observations lie at or below xiThe median is the 50th percentileThe most important percentiles are the median and the quartilesThe first quartile, Q1, is the 25th percentileThe third quartile, Q3, is the 75th percentileThe interquartile range (IQR) is the difference between the third and first quartilesIQR = Q3 - Q1

Data Mining: Concepts and Techniques

* of 25* of 58

The Five Number SummaryTo describe a set of observations the five number summary is often usedThe five number summary consists of:The minimumQ1The medianQ3The maximumBox plots are used to display the summary

Data Mining: Concepts and Techniques

* of 25* of 58

Variance & Standard DeviationThe variance of N observations x1, x2, x3,,xN is given as:

The standard deviation, is the square root of the variance

Data Mining: Concepts and Techniques

* of 25* of 58

What Kind Of Data Are We Talking About?Variables/FeaturesClass/ TargetTuples/Records

Wind Speed (mph)Wind DirectionTemp (C)Wave Size (ft)Wave Period (secs)5Offshore103153Onshore111610Offshore851112Offshore1571010Onshore66112Offshore2812

Good Surf?YesNoYesYesNoNo

Data Mining: Concepts and Techniques

* of 25* of 58

Why Data Preprocessing?Data in the real world is dirtyIncomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate datae.g., occupation=Noisy: containing errors or outlierse.g., Salary=-10Inconsistent: containing discrepancies in codes or namese.g., Age=42 Birthday=03/07/1997e.g., Was rating 1,2,3, now rating A, B, Ce.g., discrepancy between duplicate records

Data Mining: Concepts and Techniques

* of 25* of 58

Why Is Data Dirty?Incomplete data comes fromN/A data value when collectedDifferent consideration between the time when the data was collected and when it is analyzedHuman/hardware/software problemsNoisy data comes from the processing of dataCollectionEntryTransmissionInconsistent data comes fromDifferent data sourcesFunctional dependency violation

Data Mining: Concepts and Techniques

* of 25* of 58

Why Is Data Preprocessing Important?No quality data, no quality mining results!Quality decisions must be based on quality dataE.g. duplicate or missing data may cause incorrect or even misleading statistics.Data warehouses need consistent integration of quality dataData extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse Bill Inmon

Data Mining: Concepts and Techniques

* of 25* of 58

Major Tasks In Data PreprocessingData cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistenciesData integrationIntegration of multiple databases, data cubes, or filesData transformationNormalization and aggregationData reductionObtains reduced representation in volume but produces the same or similar analytical resultsData discretizationPart of data reduction but with particular importance, especially for numerical data

Data Mining: Concepts and Techniques

* of 25* of 58

Forms Of Data Preprocessing

Data Mining: Concepts and Techniques

* of 25* of 58

Data Cleaning

Data cleaning tasksFill in missing valuesIdentify outliers and smooth out noisy data Correct inconsistent dataResolve redundancy caused by data integrationData cleaning is one of the three biggest problems in data warehousing Ralph KimballData cleaning is the number one problem in data warehousing DCI Survey

Data Mining: Concepts and Techniques

* of 25* of 58

Data Cleaning Example

Data Mining: Concepts and Techniques

* of 25* of 58

Missing DataData is not always availableE.g. many tuples have no recorded value for several attributes, such as customer income in sales dataMissing data may be due to:Equipment malfunctionInconsistent with other recorded data and thus deletedData not entered due to misunderstandingCertain data may not be considered important at the time of entryNot registering history or changes of the data

Data Mining: Concepts and Techniques

* of 25* of 58

How To Handle Missing Data?Ignore the tupleUsually done when class label is missing Fill in the missing value manuallyTedious & infeasible?Fill in the missing value automaticallyUse a global constant, e.g. unknown Use the attribute meanUse the attribute mean for all samples belonging to the same classUse the most probable value

Data Mining: Concepts and Techniques

* of 25* of 58

Noisy DataNoise: Random error or variance in a measured variableIncorrect attribute values may be due to:Faulty data collection instrumentsData entry problemsData transmission problemsTechnology limitationInconsistency in naming convention

Data Mining: Concepts and Techniques

* of 25* of 58

How to Handle Noisy Data?Binning methodFirst sort data and partition into (equi-depth) binsThen one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.ClusteringDetect and remove outliersCombined computer and human inspectionDetect suspicious values and check by human RegressionSmooth by fitting the data into regression functions

Data Mining: Concepts and Techniques

* of 25* of 58

Simple Discretization Methods: BinningEqual-depth (frequency) partitioning:Divides the range into N intervals, each containing approximately same number of samplesGood data scalingManaging categorical attributes can be trickyEqual-width (distance) partitioning:Divides the range into N intervals of equal size: uniform gridIf A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B A)/N.The most straightforward, but outliers may dominate presentationSkewed data is not handled well.

Data Mining: Concepts and Techniques

* of 25* of 58

Cluster Analysis

Data Mining: Concepts and Techniques

* of 25* of 58

Regressionxyy = x + 1X1Y1Y1

Data Mining: Concepts and Techniques

* of 25* of 58

Data IntegrationData integration: Combines data from multiple sources into a coherent storeSchema integration:Integrate metadata from different sourcesEntity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#Detecting and resolving data value conflictsFor the same real world entity, attribute values from different sources are differentPossible reasons: Different representations, different scales, e.g., metric v imperial

Data Mining: Concepts and Techniques

* of 25* of 58

Handling Redundancy In Data IntegrationRedundant data occur often through integration of multiple databasesThe same attribute may have different names in different databasesOne attribute may be a derived attribute in another table, e.g. annual revenueRedundant data may be able to be detected by correlation analysisCareful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Data Mining: Concepts and Techniques

* of 25* of 58

Data TransformationSmoothing: remove noise

Recommended