unit-2 data preprocessing lecturetopic ********************************************** lecture-13why...
TRANSCRIPT
UNIT-2 Data PreprocessingUNIT-2 Data Preprocessing
LectureLecture TopicTopic
********************************************************************************************
Lecture-13Lecture-13 Why preprocess the data?Why preprocess the data?
Lecture-14Lecture-14 Data cleaning Data cleaning
Lecture-15Lecture-15 Data integration and transformationData integration and transformation
Lecture-16Lecture-16 Data reductionData reduction
Lecture-17 Lecture-17 Discretization and concept Discretization and concept
hierarchgenerationhierarchgeneration
Lecture-13Lecture-13Why preprocess the data?Why preprocess the data?
Lecture-13 Why Data Preprocessing?Lecture-13 Why Data Preprocessing?
Data in the real world is:Data in the real world is: incomplete: lacking attribute values, lacking certain incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate attributes of interest, or containing only aggregate datadata
noisy: containing errors or outliersnoisy: containing errors or outliers inconsistent: containing discrepancies in codes or inconsistent: containing discrepancies in codes or
namesnames
No quality data, no quality mining results!No quality data, no quality mining results! Quality decisions must be based on quality dataQuality decisions must be based on quality data Data warehouse needs consistent integration of Data warehouse needs consistent integration of
quality dataquality data
Lecture-13 Why Data Preprocessing?Lecture-13 Why Data Preprocessing?
Multi-Dimensional Measure of Data QualityMulti-Dimensional Measure of Data Quality
A well-accepted multidimensional view:A well-accepted multidimensional view: AccuracyAccuracy CompletenessCompleteness ConsistencyConsistency TimelinessTimeliness BelievabilityBelievability Value addedValue added InterpretabilityInterpretability AccessibilityAccessibility
Broad categories:Broad categories: intrinsic, contextual, representational, and intrinsic, contextual, representational, and
accessibility.accessibility.
Lecture-13 Why Data Preprocessing?Lecture-13 Why Data Preprocessing?
Major Tasks in Data PreprocessingMajor Tasks in Data Preprocessing
Data cleaningData cleaning Fill in missing values, smooth noisy data, identify or remove Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistenciesoutliers, and resolve inconsistencies
Data integrationData integration Integration of multiple databases, data cubes, or filesIntegration of multiple databases, data cubes, or files
Data transformationData transformation Normalization and aggregationNormalization and aggregation
Data reductionData reduction Obtains reduced representation in volume but produces the Obtains reduced representation in volume but produces the
same or similar analytical resultssame or similar analytical results
Data discretizationData discretization Part of data reduction but with particular importance, especially Part of data reduction but with particular importance, especially
for numerical datafor numerical data
Lecture-13 Why Data Preprocessing?Lecture-13 Why Data Preprocessing?
Forms of data preprocessingForms of data preprocessing
Lecture-13 Why Data Preprocessing?Lecture-13 Why Data Preprocessing?
Lecture-14Lecture-14
Data cleaning Data cleaning
Data CleaningData Cleaning
Data cleaning tasksData cleaning tasks
Fill in missing valuesFill in missing values
Identify outliers and smooth out noisy data Identify outliers and smooth out noisy data
Correct inconsistent dataCorrect inconsistent data
Lecture-14 - Data cleaningLecture-14 - Data cleaning
Missing DataMissing Data
Data is not always availableData is not always available E.g., many tuples have no recorded value for several E.g., many tuples have no recorded value for several
attributes, such as customer income in sales dataattributes, such as customer income in sales data
Missing data may be due to Missing data may be due to equipment malfunctionequipment malfunction
inconsistent with other recorded data and thus deletedinconsistent with other recorded data and thus deleted
data not entered due to misunderstandingdata not entered due to misunderstanding
certain data may not be considered important at the time of certain data may not be considered important at the time of
entryentry
not register history or changes of the datanot register history or changes of the data
Missing data may need to be inferred.Missing data may need to be inferred.Lecture-14 - Data cleaningLecture-14 - Data cleaning
How to Handle Missing Data?How to Handle Missing Data?
Ignore the tuple: usually done when class label Ignore the tuple: usually done when class label
is missing is missing
Fill in the missing value manuallyFill in the missing value manually
Use a global constant to fill in the missing value: Use a global constant to fill in the missing value:
ex. “unknown” ex. “unknown”
Lecture-14 - Data cleaningLecture-14 - Data cleaning
How to Handle Missing Data?How to Handle Missing Data?
Use the attribute mean to fill in the missing valueUse the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the Use the attribute mean for all samples belonging to the
same class to fill in the missing valuesame class to fill in the missing value
Use the most probable value to fill in the missing value:Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decisioninference-based such as Bayesian formula or decision
treetree
Lecture-14 - Data cleaningLecture-14 - Data cleaning
Noisy DataNoisy DataNoise: random error or variance in a measured Noise: random error or variance in a measured variablevariableIncorrect attribute values may due toIncorrect attribute values may due to faulty data collection instrumentsfaulty data collection instruments data entry problemsdata entry problems data transmission problemsdata transmission problems technology limitationtechnology limitation inconsistency in naming convention inconsistency in naming convention
Other data problems which requires data cleaningOther data problems which requires data cleaning duplicate recordsduplicate records incomplete dataincomplete data inconsistent datainconsistent data
Lecture-14 - Data cleaningLecture-14 - Data cleaning
How to Handle Noisy Data?How to Handle Noisy Data?
Binning method:Binning method: first sort data and partition into (equal-first sort data and partition into (equal-
frequency) binsfrequency) bins then one can smooth by bin means, smooth then one can smooth by bin means, smooth
by bin median, smooth by bin boundariesby bin median, smooth by bin boundaries
ClusteringClustering detect and remove outliersdetect and remove outliers
RegressionRegression smooth by fitting the data to a regression smooth by fitting the data to a regression
functions – linear regression functions – linear regression Lecture-14 - Data cleaningLecture-14 - Data cleaning
Simple Discretization Methods: BinningSimple Discretization Methods: Binning
Equal-width (distance) partitioning:Equal-width (distance) partitioning: It divides the range into It divides the range into NN intervals of equal size: intervals of equal size:
uniform griduniform grid if if AA and and BB are the lowest and highest values of the are the lowest and highest values of the
attribute, the width of intervals will be: attribute, the width of intervals will be: W W = (= (BB--AA)/)/N.N. The most straightforwardThe most straightforward But outliers may dominate presentationBut outliers may dominate presentation Skewed data is not handled well.Skewed data is not handled well.
Equal-depth (frequency) partitioning:Equal-depth (frequency) partitioning: It divides the range into It divides the range into NN intervals, each containing intervals, each containing
approximately same number of samplesapproximately same number of samples Good data scalingGood data scaling Managing categorical attributes can be tricky.Managing categorical attributes can be tricky.
Lecture-14 - Data cleaningLecture-14 - Data cleaning
Binning Methods for Data SmoothingBinning Methods for Data Smoothing* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 3426, 28, 29, 34* Partition into (equi-depth) bins:* Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15- Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25- Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34- Bin 3: 26, 28, 29, 34* Smoothing by bin means:* Smoothing by bin means: - Bin 1: 9, 9, 9, 9- Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23- Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29- Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries:* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15- Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25- Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34- Bin 3: 26, 26, 26, 34
Lecture-14 - Data cleaningLecture-14 - Data cleaning
Cluster AnalysisCluster Analysis
Lecture-14 - Data cleaningLecture-14 - Data cleaning
RegressionRegression
x
y
y = x + 1
X1
Y1
Y1’
Lecture-14 - Data cleaningLecture-14 - Data cleaning
Lecture-15Lecture-15
Data integration and Data integration and
transformationtransformation
Data IntegrationData Integration
Data integration: Data integration: combines data from multiple sources into a coherent combines data from multiple sources into a coherent
storestore
Schema integrationSchema integration integrate metadata from different sourcesintegrate metadata from different sources Entity identification problem: identify real world entities Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id from multiple data sources, e.g., A.cust-id B.cust-# B.cust-#
Detecting and resolving data value conflictsDetecting and resolving data value conflicts for the same real world entity, attribute values from for the same real world entity, attribute values from
different sources are differentdifferent sources are different possible reasons: different representations, different possible reasons: different representations, different
scales, e.g., metric vs. British unitsscales, e.g., metric vs. British units
Lecture-15 - Data integration and transformationLecture-15 - Data integration and transformation
Handling Redundant Data in Data Handling Redundant Data in Data IntegrationIntegration
Redundant data occur often when Redundant data occur often when integration of multiple databasesintegration of multiple databases The same attribute may have different names The same attribute may have different names
in different databasesin different databases One attribute may be a “derived” attribute in One attribute may be a “derived” attribute in
another table, e.g., annual revenueanother table, e.g., annual revenue
Lecture-15 - Data integration and transformationLecture-15 - Data integration and transformation
Handling Redundant Data in Data Handling Redundant Data in Data IntegrationIntegration
Redundant data may be able to be Redundant data may be able to be detected by correlation analysisdetected by correlation analysis
Careful integration of the data from Careful integration of the data from multiple sources may help reduce/avoid multiple sources may help reduce/avoid redundancies and inconsistencies and redundancies and inconsistencies and improve mining speed and qualityimprove mining speed and quality
Lecture-15 - Data integration and transformationLecture-15 - Data integration and transformation
Data TransformationData Transformation
Smoothing: remove noise from dataSmoothing: remove noise from data
Aggregation: summarization, data cube Aggregation: summarization, data cube constructionconstruction
Generalization: concept hierarchy climbingGeneralization: concept hierarchy climbing
Lecture-15 - Data integration and transformationLecture-15 - Data integration and transformation
Data TransformationData Transformation
Normalization: scaled to fall within a small, Normalization: scaled to fall within a small, specified rangespecified range min-max normalizationmin-max normalization z-score normalizationz-score normalization normalization by decimal scalingnormalization by decimal scaling
Attribute/feature constructionAttribute/feature construction New attributes constructed from the given onesNew attributes constructed from the given ones
Lecture-15 - Data integration and transformationLecture-15 - Data integration and transformation
Data Transformation: NormalizationData Transformation: Normalization
min-max normalizationmin-max normalization
z-score normalizationz-score normalization
normalization by decimal scalingnormalization by decimal scaling
AAA
AA
A
minnewminnewmaxnewminmax
minvv _)__('
A
A
devstand
meanvv
_'
j
vv
10' Where j is the smallest integer such that Max(| |)<1'v
Lecture-15 - Data integration and transformationLecture-15 - Data integration and transformation
Lecture-16Lecture-16
Data reductionData reduction
Data Reduction Data Reduction
Warehouse may store terabytes of data: Warehouse may store terabytes of data: Complex data analysis/mining may take a Complex data analysis/mining may take a very long time to run on the complete data very long time to run on the complete data setset
Data reduction Data reduction Obtains a reduced representation of the data set Obtains a reduced representation of the data set
that is much smaller in volume but yet produces that is much smaller in volume but yet produces the same (or almost the same) analytical resultsthe same (or almost the same) analytical results
Lecture-16 - Data reductionLecture-16 - Data reduction
Data Reduction StrategiesData Reduction Strategies
Data reduction strategiesData reduction strategies Data cube aggregationData cube aggregation Attribute subset selectionAttribute subset selection Dimensionality reductionDimensionality reduction Numerosity reductionNumerosity reduction Discretization and concept hierarchy Discretization and concept hierarchy
generationgeneration
Lecture-16 - Data reductionLecture-16 - Data reduction
Data Cube AggregationData Cube Aggregation
The lowest level of a data cubeThe lowest level of a data cube the aggregated data for an individual entity of interestthe aggregated data for an individual entity of interest
e.g., a customer in a phone calling data warehouse.e.g., a customer in a phone calling data warehouse.
Multiple levels of aggregation in data cubesMultiple levels of aggregation in data cubes Further reduce the size of data to deal withFurther reduce the size of data to deal with
Reference appropriate levelsReference appropriate levels Use the smallest representation which is enough to Use the smallest representation which is enough to
solve the tasksolve the task
Queries regarding aggregated information should Queries regarding aggregated information should
be answered using data cube, when possiblebe answered using data cube, when possibleLecture-16 - Data reductionLecture-16 - Data reduction
Dimensionality ReductionDimensionality Reduction
Feature selection (attribute subset selection):Feature selection (attribute subset selection): Select a minimum set of features such that the Select a minimum set of features such that the
probability distribution of different classes given the probability distribution of different classes given the values for those features is as close as possible to the values for those features is as close as possible to the original distribution given the values of all featuresoriginal distribution given the values of all features
reduce # of patterns in the patterns, easier to understandreduce # of patterns in the patterns, easier to understand
Heuristic methods Heuristic methods step-wise forward selectionstep-wise forward selection step-wise backward eliminationstep-wise backward elimination combining forward selection and backward eliminationcombining forward selection and backward elimination decision-tree inductiondecision-tree induction
Lecture-16 - Data reductionLecture-16 - Data reduction
Wavelet Transforms Wavelet Transforms
Discrete wavelet transform (DWT): linear signal Discrete wavelet transform (DWT): linear signal processing processing
Compressed approximation: store only a small fraction of Compressed approximation: store only a small fraction of the strongest of the wavelet coefficientsthe strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better Similar to discrete Fourier transform (DFT), but better lossy compression, localized in spacelossy compression, localized in space
Method:Method: Length, L, must be an integer power of 2 (padding with 0s, when Length, L, must be an integer power of 2 (padding with 0s, when
necessary)necessary) Each transform has 2 functions: smoothing, differenceEach transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two set of data of length L/2Applies to pairs of data, resulting in two set of data of length L/2 Applies two functions recursively, until reaches the desired lengthApplies two functions recursively, until reaches the desired length
Haar2 Daubechie4
Lecture-16 - Data reductionLecture-16 - Data reduction
Given Given NN data vectors from data vectors from kk-dimensions, find -dimensions, find c c <= k <= k orthogonal vectors that can be best used orthogonal vectors that can be best used to represent data to represent data The original data set is reduced to one consisting of N The original data set is reduced to one consisting of N
data vectors on c principal components (reduced data vectors on c principal components (reduced dimensions) dimensions)
Each data vector is a linear combination of the Each data vector is a linear combination of the cc principal component vectorsprincipal component vectors
Works for numeric data onlyWorks for numeric data only
Used when the number of dimensions is largeUsed when the number of dimensions is large
Principal Component AnalysisPrincipal Component Analysis
Lecture-16 - Data reductionLecture-16 - Data reduction
X1
X2
Y1
Y2
Principal Component Analysis
Lecture-16 - Data reductionLecture-16 - Data reduction
Attribute subset selectionAttribute subset selection
Attribute subset selection reduces the data Attribute subset selection reduces the data set size by removing irrelevent or set size by removing irrelevent or redundant attributes.redundant attributes.
Goal is find min set of attributesGoal is find min set of attributes
Uses basic heuristic methods of attribute Uses basic heuristic methods of attribute selectionselection
Lecture-16 - Data reductionLecture-16 - Data reduction
Example of Decision Tree Induction
Initial attribute set:{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
Lecture-16 - Data reductionLecture-16 - Data reduction
Numerosity ReductionNumerosity ReductionParametric methodsParametric methods Assume the data fits some model, estimate model Assume the data fits some model, estimate model
parameters, store only the parameters, and discard parameters, store only the parameters, and discard the data (except possible outliers)the data (except possible outliers)
Log-linear models: obtain value at a point in m-D Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal space as the product on appropriate marginal subspaces subspaces
Non-parametric methods Non-parametric methods Do not assume modelsDo not assume models Major families: histograms, clustering, sampling Major families: histograms, clustering, sampling
Lecture-16 - Data reductionLecture-16 - Data reduction
Regression and Log-Linear ModelsRegression and Log-Linear Models
Linear regression: Data are modeled to fit a Linear regression: Data are modeled to fit a
straight linestraight line
Often uses the least-square method to fit the lineOften uses the least-square method to fit the line
Multiple regression: allows a response variable Multiple regression: allows a response variable
Y to be modeled as a linear function of Y to be modeled as a linear function of
multidimensional feature vectormultidimensional feature vector
Log-linear model: approximates discrete Log-linear model: approximates discrete
multidimensional probability distributionsmultidimensional probability distributionsLecture-16 - Data reductionLecture-16 - Data reduction
Linear regressionLinear regression: : Y = Y = + + X X Two parameters , Two parameters , and and specify the line and are to specify the line and are to
be estimated by using the data at hand.be estimated by using the data at hand. using the least squares criterion to the known values of using the least squares criterion to the known values of
YY11, Y, Y22, …, X, …, X11, X, X22, …., ….
Multiple regressionMultiple regression: : Y = b0 + b1 X1 + b2 X2.Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the Many nonlinear functions can be transformed into the
above.above.
Log-linear modelsLog-linear models:: The multi-way table of joint probabilities is The multi-way table of joint probabilities is
approximated by a product of lower-order tables.approximated by a product of lower-order tables. Probability: Probability: p(a, b, c, d) = p(a, b, c, d) = ab ab acacadad bcdbcd
Regress Analysis and Log-Linear Regress Analysis and Log-Linear ModelsModels
Lecture-16 - Data reductionLecture-16 - Data reduction
HistogramsHistogramsA popular data A popular data reduction techniquereduction technique
Divide data into Divide data into buckets and store buckets and store average (sum) for average (sum) for each bucketeach bucket
Can be constructed Can be constructed optimally in one optimally in one dimension using dimension using dynamic programmingdynamic programming
Related to Related to quantization problems.quantization problems.
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000Lecture-16 - Data reductionLecture-16 - Data reduction
ClusteringClustering
Partition data set into clusters, and one can store cluster Partition data set into clusters, and one can store cluster
representation onlyrepresentation only
Can be very effective if data is clustered but not if data Can be very effective if data is clustered but not if data
is “smeared”is “smeared”
Can have hierarchical clustering and be stored in multi-Can have hierarchical clustering and be stored in multi-
dimensional index tree structuresdimensional index tree structures
There are many choices of clustering definitions and There are many choices of clustering definitions and
clustering algorithms. clustering algorithms.
Lecture-16 - Data reductionLecture-16 - Data reduction
SamplingSamplingAllows a large data set to be represented Allows a large data set to be represented by a much smaller of the data.by a much smaller of the data.
Let a large data set D, contains N tuples. Let a large data set D, contains N tuples.
Methods to reduce data set D:Methods to reduce data set D: Simple random sample without replacement Simple random sample without replacement
(SRSWOR)(SRSWOR) Simple random sample with replacement Simple random sample with replacement
(SRSWR)(SRSWR) Cluster sampleCluster sample Stright sampleStright sample
Lecture-16 - Data reductionLecture-16 - Data reduction
Sampling
SRSWOR
(simple random
sample without
replacement)
SRSWR
Raw DataLecture-16 - Data reductionLecture-16 - Data reduction
SamplingSampling
Raw Data Cluster/Stratified Sample
Lecture-16 - Data reductionLecture-16 - Data reduction
Lecture-17Lecture-17
Discretization and concept Discretization and concept
hierarchy generationhierarchy generation
DiscretizationDiscretization
Three types of attributes:Three types of attributes: Nominal Nominal — — values from an unordered setvalues from an unordered set Ordinal Ordinal —— values from an ordered set values from an ordered set Continuous Continuous —— real numbers real numbers
Discretization: divide the range of a continuous Discretization: divide the range of a continuous attribute into intervalsattribute into intervals Some classification algorithms only accept Some classification algorithms only accept
categorical attributes.categorical attributes. Reduce data size by discretizationReduce data size by discretization Prepare for further analysisPrepare for further analysis
Lecture-17 - Discretization and concept hierarchy generationLecture-17 - Discretization and concept hierarchy generation
Discretization and Concept hierachyDiscretization and Concept hierachy
DiscretizationDiscretization reduce the number of values for a given continuous reduce the number of values for a given continuous
attribute by dividing the range of the attribute into attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace intervals. Interval labels can then be used to replace actual data values.actual data values.
Concept hierarchiesConcept hierarchies reduce the data by collecting and replacing low level reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute concepts (such as numeric values for the attribute age) by higher level concepts (such as young, age) by higher level concepts (such as young, middle-aged, or senior).middle-aged, or senior).
Lecture-17 - Discretization and concept hierarchy generationLecture-17 - Discretization and concept hierarchy generation
Discretization and concept hierarchy Discretization and concept hierarchy generation for numeric datageneration for numeric data
Binning Binning
Histogram analysis Histogram analysis
Clustering analysisClustering analysis
Entropy-based discretizationEntropy-based discretization
Discretization by intuitive partitioningDiscretization by intuitive partitioning
Lecture-17 - Discretization and concept hierarchy generationLecture-17 - Discretization and concept hierarchy generation
Entropy-Based DiscretizationEntropy-Based Discretization
Given a set of samples S, if S is partitioned into Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the two intervals S1 and S2 using boundary T, the entropy after partitioning isentropy after partitioning is
The boundary that minimizes the entropy function The boundary that minimizes the entropy function over all possible boundaries is selected as a over all possible boundaries is selected as a binary discretization.binary discretization.
The process is recursively applied to partitions The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,obtained until some stopping criterion is met, e.g.,
Experiments show that it may reduce data size Experiments show that it may reduce data size and improve classification accuracyand improve classification accuracy
E S TSEnt
SEntS S S S( , )
| |
| |( )
| |
| |( ) 1
12
2
Ent S E T S( ) ( , )
Lecture-17 - Discretization and concept hierarchy generationLecture-17 - Discretization and concept hierarchy generation
Discretization by intuitive partitioningDiscretization by intuitive partitioning
3-4-5 rule can be used to segment numeric data into 3-4-5 rule can be used to segment numeric data into
relatively uniform, “natural” intervals.relatively uniform, “natural” intervals.
* If an interval covers 3, 6, 7 or 9 distinct values at the most * If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equal-width significant digit, partition the range into 3 equal-width
intervalsintervals
* If it covers 2, 4, or 8 distinct values at the most significant * If it covers 2, 4, or 8 distinct values at the most significant
digit, partition the range into 4 intervalsdigit, partition the range into 4 intervals
* If it covers 1, 5, or 10 distinct values at the most * If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervalssignificant digit, partition the range into 5 intervals
Lecture-17 - Discretization and concept hierarchy generationLecture-17 - Discretization and concept hierarchy generation
Example of 3-4-5 ruleExample of 3-4-5 rule
(-$4000 -$5,000)
(-$400 - 0)
(-$400 - -$300)
(-$300 - -$200)
(-$200 - -$100)
(-$100 - 0)
(0 - $1,000)
(0 - $200)
($200 - $400)
($400 - $600)
($600 - $800) ($800 -
$1,000)
($2,000 - $5, 000)
($2,000 - $3,000)
($3,000 - $4,000)
($4,000 - $5,000)
($1,000 - $2, 000)
($1,000 - $1,200)
($1,200 - $1,400)
($1,400 - $1,600)
($1,600 - $1,800) ($1,800 -
$2,000)
msd=1,000 Low=-$1,000 High=$2,000Step 2:
Step 4:
Step 1: -$351 -$159 profit $1,838 $4,700
Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
count
(-$1,000 - $2,000)
(-$1,000 - 0) (0 -$ 1,000)
Step 3:
($1,000 - $2,000)
Lecture-17 - Discretization and concept hierarchy generationLecture-17 - Discretization and concept hierarchy generation
Concept hierarchy generation for categorical Concept hierarchy generation for categorical datadata
Specification of a partial ordering of attributes Specification of a partial ordering of attributes explicitly at the schema level by users or expertsexplicitly at the schema level by users or experts
Specification of a portion of a hierarchy by Specification of a portion of a hierarchy by
explicit data groupingexplicit data grouping
Specification of a set of attributes, but not of Specification of a set of attributes, but not of
their partial orderingtheir partial ordering
Specification of only a partial set of attributesSpecification of only a partial set of attributes
Lecture-17 - Discretization and concept hierarchy generationLecture-17 - Discretization and concept hierarchy generation
Specification of a set of attributesSpecification of a set of attributes
Concept hierarchy can be automatically Concept hierarchy can be automatically generated based on the number of distinct generated based on the number of distinct values per attribute in the given attribute set. values per attribute in the given attribute set. The attribute with the most distinct values is The attribute with the most distinct values is placed at the lowest level of the hierarchy.placed at the lowest level of the hierarchy.
country
province_or_ state
city
street
15 distinct values
65 distinct values
3567 distinct values
674,339 distinct valuesLecture-17 - Discretization and concept hierarchy generationLecture-17 - Discretization and concept hierarchy generation