Business Systems Intelligence: 2. Data Preparation

Download Business  Systems Intelligence: 2. Data Preparation

Post on 22-Jan-2016

22 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Dr. Brian Mac Namee ( www.comp.dit.ie/bmacnamee ). Business Systems Intelligence: 2. Data Preparation. Acknowledgments. These notes are based (heavily) on those provided by the authors to accompany Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber - PowerPoint PPT Presentation

TRANSCRIPT

  • Business Systems Intelligence:2. Data PreparationDr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)

    Data Mining: Concepts and Techniques

    * of 25* of 58

    AcknowledgmentsThese notes are based (heavily) on those provided by the authors to accompany Data Mining: Concepts & Techniques by Jiawei Han and Micheline KamberSome slides are also based on trainers kits provided byMore information about the book is available at: www-sal.cs.uiuc.edu/~hanj/bk2/And information on SAS is available at: www.sas.com

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Data PreprocessingToday we will look at data preprocessing and in particular:Descriptive data summarizationWhat kind of data are we talking about?Why preprocess data?Data cleaningData integration and transformationData reductionDiscretization and concept hierarchy generationSummary

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Descriptive Data SummarizationDescriptive data summarization techniques can be used to identify the typical properties of your dataWe will take a look at:Mean, median, mode and midrangeQuartiles, interquartile range and varianceWe will also introduce the notions of a distributive measure, an algebraic measure and a holistic measureFor all of these measures we will assume a set of attribute observations x1, x2, x3,,xN

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Measuring The Central TendencyThe central tendency of a data set can be considered a measure of the middle of the dataThe most simple, and commonly used is the arithmetic meanThe mean is calculated as:

    The arithmetic mean can be upset by noise and outliers

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Measuring The Central Tendency (cont)For skewed data the median can be a better measure than the meanGiven a sorted numerical data set of N distinct values:If N is odd the median is the middle valueIf N is even it is the average of the two middle valuesThe mode of a data set is the value that occurs most frequently in the setThe mode may correspond to more than one value

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Measuring The Dispersion Of DataThe degree to which a data set is spread out is known as the dispersion or variance of the dataTypical measure of dispersion invclude:RangeInterquartile rangeFive-number summaryStandard deviationThe range of a set of observations is the difference between the largest and the smallest values

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Percentiles & QuartilesThe kth percentile of a set of data in numerical order is the value xi having the property that k percent of the observations lie at or below xiThe median is the 50th percentileThe most important percentiles are the median and the quartilesThe first quartile, Q1, is the 25th percentileThe third quartile, Q3, is the 75th percentileThe interquartile range (IQR) is the difference between the third and first quartilesIQR = Q3 - Q1

    Data Mining: Concepts and Techniques

    * of 25* of 58

    The Five Number SummaryTo describe a set of observations the five number summary is often usedThe five number summary consists of:The minimumQ1The medianQ3The maximumBox plots are used to display the summary

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Variance & Standard DeviationThe variance of N observations x1, x2, x3,,xN is given as:

    The standard deviation, is the square root of the variance

    Data Mining: Concepts and Techniques

    * of 25* of 58

    What Kind Of Data Are We Talking About?Variables/FeaturesClass/ TargetTuples/Records

    Wind Speed (mph)Wind DirectionTemp (C)Wave Size (ft)Wave Period (secs)5Offshore103153Onshore111610Offshore851112Offshore1571010Onshore66112Offshore2812

    Good Surf?YesNoYesYesNoNo

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Why Data Preprocessing?Data in the real world is dirtyIncomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate datae.g., occupation=Noisy: containing errors or outlierse.g., Salary=-10Inconsistent: containing discrepancies in codes or namese.g., Age=42 Birthday=03/07/1997e.g., Was rating 1,2,3, now rating A, B, Ce.g., discrepancy between duplicate records

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Why Is Data Dirty?Incomplete data comes fromN/A data value when collectedDifferent consideration between the time when the data was collected and when it is analyzedHuman/hardware/software problemsNoisy data comes from the processing of dataCollectionEntryTransmissionInconsistent data comes fromDifferent data sourcesFunctional dependency violation

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Why Is Data Preprocessing Important?No quality data, no quality mining results!Quality decisions must be based on quality dataE.g. duplicate or missing data may cause incorrect or even misleading statistics.Data warehouses need consistent integration of quality dataData extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse Bill Inmon

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Major Tasks In Data PreprocessingData cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistenciesData integrationIntegration of multiple databases, data cubes, or filesData transformationNormalization and aggregationData reductionObtains reduced representation in volume but produces the same or similar analytical resultsData discretizationPart of data reduction but with particular importance, especially for numerical data

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Forms Of Data Preprocessing

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Data Cleaning

    Data cleaning tasksFill in missing valuesIdentify outliers and smooth out noisy data Correct inconsistent dataResolve redundancy caused by data integrationData cleaning is one of the three biggest problems in data warehousing Ralph KimballData cleaning is the number one problem in data warehousing DCI Survey

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Data Cleaning Example

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Missing DataData is not always availableE.g. many tuples have no recorded value for several attributes, such as customer income in sales dataMissing data may be due to:Equipment malfunctionInconsistent with other recorded data and thus deletedData not entered due to misunderstandingCertain data may not be considered important at the time of entryNot registering history or changes of the data

    Data Mining: Concepts and Techniques

    * of 25* of 58

    How To Handle Missing Data?Ignore the tupleUsually done when class label is missing Fill in the missing value manuallyTedious & infeasible?Fill in the missing value automaticallyUse a global constant, e.g. unknown Use the attribute meanUse the attribute mean for all samples belonging to the same classUse the most probable value

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Noisy DataNoise: Random error or variance in a measured variableIncorrect attribute values may be due to:Faulty data collection instrumentsData entry problemsData transmission problemsTechnology limitationInconsistency in naming convention

    Data Mining: Concepts and Techniques

    * of 25* of 58

    How to Handle Noisy Data?Binning methodFirst sort data and partition into (equi-depth) binsThen one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.ClusteringDetect and remove outliersCombined computer and human inspectionDetect suspicious values and check by human RegressionSmooth by fitting the data into regression functions

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Simple Discretization Methods: BinningEqual-depth (frequency) partitioning:Divides the range into N intervals, each containing approximately same number of samplesGood data scalingManaging categorical attributes can be trickyEqual-width (distance) partitioning:Divides the range into N intervals of equal size: uniform gridIf A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B A)/N.The most straightforward, but outliers may dominate presentationSkewed data is not handled well.

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Cluster Analysis

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Regressionxyy = x + 1X1Y1Y1

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Data IntegrationData integration: Combines data from multiple sources into a coherent storeSchema integration:Integrate metadata from different sourcesEntity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#Detecting and resolving data value conflictsFor the same real world entity, attribute values from different sources are differentPossible reasons: Different representations, different scales, e.g., metric v imperial

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Handling Redundancy In Data IntegrationRedundant data occur often through integration of multiple databasesThe same attribute may have different names in different databasesOne attribute may be a derived attribute in another table, e.g. annual revenueRedundant data may be able to be detected by correlation analysisCareful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Data TransformationSmoothing: remove noise from dataAggregation: summarization, data cube constructionGeneralization: concept hierarchy climbingNormalization: scaled to fall within a small, specified rangeMin-max normalizationZ-score normalizationNormalization by decimal scalingAttribute/feature constructionNew attributes constructed from the given ones

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Data Transformation: NormalizationMin-max normalization

    Z-score normalization

    Normalization by decimal scaling

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Data ReductionA data warehouse may store terabytes of dataComplex data analysis/mining may take a very long time to run on the complete data setData reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Data Reduction StrategiesData reduction strategies include:Data cube aggregationDimensionality reductionremove unimportant attributesData CompressionNumerosity reductionfit data into modelsDiscretization and concept hierarchy generation

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Data Cube AggregationThe lowest level of a data cubeThe aggregated data for an individual entity of interestE.g. a customer in a phone calling data warehouseMultiple levels of aggregation in data cubesFurther reduce the size of data to deal withReference appropriate levelsUse the smallest representation which is enough to solve the taskQueries regarding aggregated information should be answered using data cube, when possible

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Data Cube Aggregation (cont)

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Dimensionality ReductionFeature selection (i.e., attribute subset selection):Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all featuresReduce # of patterns in the patterns, easier to understandHow can we do this?

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Dimensionality Reduction (cont)There are 2d possible sub-features of d featuresHeuristic methods (due to exponential # of choices):Step-wise forward selectionStep-wise backward eliminationCombining forward selection and backward eliminationDecision-tree induction

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Heuristic Feature Selection MethodsSeveral heuristic feature selection methods:Best single features under the feature independence assumption: choose by significance tests.Best step-wise feature selection: The best single-feature is picked firstThen next best feature condition to the first, ...Step-wise feature elimination:Repeatedly eliminate the worst featureBest combined feature selection and eliminationOptimal branch and bound:Use feature elimination and backtracking

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Example Of Decision Tree InductionInitial attribute set: {A1, A2, A3, A4, A5, A6}

    => Reduced attribute set: {A1, A4, A6}A4?A1?A6?Class 1Class 2Class 1Class 2

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Data CompressionString compressionThere are extensive theories and well-tuned algorithmsTypically lossless compression is usedOnly limited manipulation is possible without expansionAudio/video compressionTypically lossy compression, with progressive refinementSometimes small fragments of signal can be reconstructed without reconstructing the whole

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Data Compression TypesOriginal DataCompressed DatalosslessOriginal DataApproximated lossy

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Data Compression TechniquesData compression techniques include:Wavelet transformationsPrinciple components analysisNumerosity reductionParametric methodsAssume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)Non-parametric methods Do not assume modelsMajor families: histograms, clustering, sampling

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Parametric Methods: RegressionLinear regressionData are modeled to fit a straight lineOften uses the least-square method to fit the lineMultiple regressionAllows a response variable Y to be modeled as a linear function of multidimensional feature vector

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Regression AnalysisLinear regression: Y = + XTwo parameters, and specify the line and are estimated by using the data at handUsing the least squares criterion to the known values of Y1, Y2, , X1, X2, .Multiple regression: Y = b0 + b1 X1 + b2 X2Many nonlinear functions can be transformed into the above

    * of 25* of 58

    Non-Parametric Methods: HistogramsA popular data reduction techniqueDivide data into buckets and store average for each bucketCan be constructed optimally in one dimension using dynamic programmingRelated to quantization problems

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Non-Parametric Methods: ClusteringPartition data set into clusters, and store cluster representation onlyCan be very effective if data is clustered but not if data is smearedCan have hierarchical clustering and be stored in multi-dimensional index tree structuresThere are many choices of clustering definitions and clustering algorithmsWell talk loads more about clustering later on

    Data Mining: Concepts and Techniques

    * of 25* of 58

    Non-Paramet...

Recommended

View more >