bits wase dm session 2
TRANSCRIPT
-
8/10/2019 Bits Wase Dm Session 2
1/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
Data Mining: Data
Lecture Notes for Session 2
Introduction to Data Mining
byTan, Steinbach, Kumar
By Dr. K. Satyanarayan Reddy
-
8/10/2019 Bits Wase Dm Session 2
2/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2
What is Data?
Collection of data objects and
their attributes.
An attribute is a property or
characteristic of an object
Examples: eye color of a
person, temperature, etc.
Attribute is also known as
variable, field, characteristic, or
feature & dimension.
A collection of Attributes
describe an Object
Object is also known as
Record, Point, Case, Vector,
Pattern, Sample, Entity,
Observationor Instance.
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1 Yes
Single 125K No
2 No Married 100K No
3 No Single 70K No
4
Yes
Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes
Divorced 220KNo
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Attributes
Objects
-
8/10/2019 Bits Wase Dm Session 2
3/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3
Data
Definition 1: An attribute is a property or
characteristic of an object that may vary; eitherfrom one object to another or from one time to
another.
Definition 2: A measurement scale is a rule(function) that associates a numerical or symbolic
value with an attribute of an object.
The process of Measurementis the application of a
measurement scale to associate a value with aparticular attribute of a specific object.
-
8/10/2019 Bits Wase Dm Session 2
4/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4
Attribute Values
Attribute values are numbers or symbols assignedto an attribute.
Distinction between attributes and attribute values
Same attribute can be mapped to different attributevalues
Example: height can be measured in feet or meters
Different attributes can be mapped to the same set of
valuesExample: Attribute values for ID and age are integers.
But properties of attribute values can be different
ID has no limit but age has a maximum and minimumvalue.
-
8/10/2019 Bits Wase Dm Session 2
5/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5
Measurement of Length
The way you measure an attribute is somewhat may not match theattributes properties.
1
2
3
5
5
7
8
15
10 4
A
B
C
D
E
-
8/10/2019 Bits Wase Dm Session 2
6/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6
Types of Attributes
There are different types of attributes
Nominal
Examples: ID numbers, eye color, zip codes
Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
Interval
Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
Ratio
Examples: temperature in Kelvin, length, time, counts
-
8/10/2019 Bits Wase Dm Session 2
7/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7
Properties of Attribute Values
The type of an attribute depends on which of the
following properties it possesses:
Distinctness: =
Order: < >
Addition: + - Multiplication: * /
Nominal attribute: distinctness
Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
-
8/10/2019 Bits Wase Dm Session 2
8/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8
Types of Attributes
Nominal and ordinal attributes are collectively
referred to as Categoricalor Qualitativeattributes.
As the name suggests, qualitative attributes, such as
employee ID, lack most of the properties of
numbers.
The remaining two types of attributes, Intervaland
Ratio, are collectively referred to as Quantitativeor
Numeric Attributes.
Quantitative attributes are represented by numbersand have most of the properties of numbers.
Note: Quantitative attributes can be integer-valued or
continuous.
-
8/10/2019 Bits Wase Dm Session 2
9/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9
-
8/10/2019 Bits Wase Dm Session 2
10/82
Attribute
Type
Description Examples Operations
Nominal The values of a nominal attribute are
just different names, i.e., nominal
attributes provide only enoughinformation to distinguish one object
from another. (=, )
zip codes, employee
ID numbers, eye color,
sex: {male, female}
mode, entropy,
contingency
correlation, 2test
Ordinal The values of an ordinal attribute
provide enough information to order
objects. ()
hardness of minerals,
{good, better, best},
grades, street numbers
median, percentiles,
rank correlation,
run tests, sign tests
Interval For interval attributes, the
differences between values are
meaningful, i.e., a unit of
measurement exists.
(+, - )
calendar dates,
temperature in Celsius
or Fahrenheit
mean, standard
deviation, Pearson's
correlation, tandF
tests
Ratio For ratio variables, both differences
and ratios are meaningful. (*, /)
temperature in Kelvin,
monetary quantities,
counts, age, mass,
length, electrical
current
geometric mean,
harmonic mean,
percent variation
-
8/10/2019 Bits Wase Dm Session 2
11/82
Attribute
Level
Transformation Comments
Nominal Any permutation of values If all employee ID numbers
were reassigned, would itmake any difference?
Ordinal An order preserving change of
values, i.e.,
new_value = f(old_value)wherefis a monotonic function.
An attribute encompassing
the notion of good, better
best can be representedequally well by the values
{1, 2, 3} or by { 0.5, 1,
10}.
Interval new_value =a * old_value + b
where a and b are constants
Thus, the Fahrenheit and
Celsius temperature scales
differ in terms of wheretheir zero value is and the
size of a unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.
-
8/10/2019 Bits Wase Dm Session 2
12/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12
Discrete and Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a collection ofdocuments
Often represented as integer variables.
Note: binary attributes are a special case of discrete attributes
Continuous Attribute Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and represented using a finitenumber of digits.
Continuous attributes are typically represented as floating-point variables. Asymmetric Attributes: Binary attributes where only non-zero values are important
are called asymmetric binary attributes.
Example: If the number of credits associated with each course is recorded, then the
resulting data set will consist of asymmetric discrete or continuous attributes.
-
8/10/2019 Bits Wase Dm Session 2
13/82
-
8/10/2019 Bits Wase Dm Session 2
14/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14
Important Characteristics of Structured Data
Dimensionality: The dimensionality of a data set is the number of attributes that the objectsin the data set possess.
Curse of Dimensionality: Data with a small number of dimensions tends to be qualitativelydifferent than moderate or high-dimensional data.
The difficulties associated with analyzing high-dimensional data are sometimes referred to as the
curse of dimensionality.
Sparsity
Only Presence Counts: For some data sets, such as those with asymmetric features, most
attributes of an object have values of 0; in many cases fewer than 1% of the entries are non-zero.
In practical terms, sparsity is an advantage because usually only the non-zero values need to be
stored and manipulated. This results in significant savings with respect to computation time andstorage.
Resolution
Patterns depend on the scale
It is frequently possible to obtain data at different levels of resoIution, and often the properties of
the data are different at different resolutions.
Example: The surface of the Earth seems very uneven at a resolution of a few meters, but isrelatively smooth at a resolution of tens of kilometers.
-
8/10/2019 Bits Wase Dm Session 2
15/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15
Record Data
Data that consists of a collection of records(data
objects), each of which consists of a fixed set ofdata fields (attributes).
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60KNo
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
-
8/10/2019 Bits Wase Dm Session 2
16/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16
Transaction or Market Basket Data
Transaction Data is a special type of record data, where each record (transaction) involves
a set of items.
Consider a grocery store where a set of products purchased by a customer during one
shopping trip constitutes a Transaction, while the individual products that were
purchased are the Items.
This type of data is called Market Basket Data because the items in each record are the
products in a person's "market basket."
Transaction data is a collection of sets of items, but it can
be viewed as a set of records whose fields are asymmetric
attributes.
Often, the attributes are binary, indicating whether or not an
item was purchased, but the attributes can be discrete orcontinuous, such as the number of items purchased or the
amount spent on those items.
-
8/10/2019 Bits Wase Dm Session 2
17/82
-
8/10/2019 Bits Wase Dm Session 2
18/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18
The Sparse Data Matrix
A Sparse Data Matrix is a special case
of a Data Matrix in which the attributes
are of the same type and are
asymmetric; i.e., only non-zero valuesare important.
Transaction data is an example of a
sparse data matrix that has only 0 - 1entries.
A common example is Document Data.
-
8/10/2019 Bits Wase Dm Session 2
19/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19
Document Data
Each document becomes a `term' vector,
each term is a component (attribute) of the vector,
the value of each component is the number of times
the corresponding term occurs in the document.
Document 1
season
timeout
lost
wi
ngame
score
ball
play
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
-
8/10/2019 Bits Wase Dm Session 2
20/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20
Transaction Data
A special type of record data, where
each record (transaction) involves a set of items.
For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the items.
TI D I tems
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
-
8/10/2019 Bits Wase Dm Session 2
21/82
-
8/10/2019 Bits Wase Dm Session 2
22/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
Graph Data contd
Examples: Generic graph and HTML Links
5
2
12
5
Data Mining
Graph Partitioning
Parallel Solution of Sparse Linear System of Equations
N-Body Computation and Dense Linear System Solvers
Linked Web pages
-
8/10/2019 Bits Wase Dm Session 2
23/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23
Chemical Data
Benzene Molecule: C6H6
-
8/10/2019 Bits Wase Dm Session 2
24/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24
Ordered Data
Sequences of Transactions: For some types of data, the
attributes have relationships that involve order in time or space.
An element of
the sequence
Items/Events
-
8/10/2019 Bits Wase Dm Session 2
25/82
-
8/10/2019 Bits Wase Dm Session 2
26/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26
Sequence Data Sequence data consists of a data set that is a
sequence of individual entities, such as a sequence of words orletters. It is quite similar to sequential data, except that there are
no time stamps; instead, there are positions in an ordered
sequence.
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGAGCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Example: Genomic sequence data:
the genetic information of plants and
animals can be represented in the
form of sequences of nucleotides
that are known as genes.
Many of the problems associated
with genetic sequence data involvepredicting similarities in the structure
and function of genes from
similarities in nucleotide sequences.
Ordered Data contd.
-
8/10/2019 Bits Wase Dm Session 2
27/82
-
8/10/2019 Bits Wase Dm Session 2
28/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28
Ordered Data
Spatial Data: Some objects have spatial attributes, such as positions or areas,
as well as other types of attributes.
e.g.: Weather data (precipitation, temperature, pressure) that is collected for a
variety of geographical locations.
An important aspect of spatial data is spatial autocorrelation; i.e., objects that
are physically close tend to be similar in other ways as well. Thus, two points
on the Earth that are close to each other usually have similar values for
temperature and rainfall.
Spatio-Temporal Data:
Average Monthly
Temperature ofland and ocean
-
8/10/2019 Bits Wase Dm Session 2
29/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29
Data Quality
Data Mining focuses on the
Detection and correction of data quality problems and
Use of algorithms that can tolerate poor data quality.
The first step, detection and correction, is often called Data
Cleaning.
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems:
Noise and outliers
missing values
duplicate data
-
8/10/2019 Bits Wase Dm Session 2
30/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30
Issues with Data
Measurement and Data Collection Issues: Perfect data cantbe expected. There may be
problems due to human error, limitations of measuring devices, or flaws in the data
collection process.
Values or even entire data objects may be missing or there may be spurious or duplicate
objects; i.e., multiple data objects that all correspond to a single "real" object.
e.g.: There might be two different records for a person who has recently lived at two
different addresses.
Even if all the data is present and "looks fine," there may be inconsistencies-a person
has a height of 2 meters, but weighs only 2 kilograms.
Measurement and Data Collection Errors: The term Measurement error refers to any
problem resulting from the measurement process it may be the value recordeddiffers
from the true value.
For continuous attributes, the numerical difference of the measured and true value is
called the error.
The term data collection error refers to errors such as omitting data objects or attribute
values, or inappropriately including a data object.
e.g.: A study of animals of a certain species might include animals of a related species
that are similar in appearance to the species of interest. Both measurement errors and
data collection errors can be either systematic or random.
-
8/10/2019 Bits Wase Dm Session 2
31/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31
Noise
Noise refers to modification of original values
Examples: distortion of a personsvoice when talking
on a poor phone and snowon television screen
Two Sine Waves Two Sine Waves + Noise
d
-
8/10/2019 Bits Wase Dm Session 2
32/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32
Precision, Bias and Accuracy
Precision: The closeness of repeated measurements (of the
same quantity) to one another.Bias: A systematic quantity being measured.
Precisionis often measured by the Standard Deviationof a set
of values, while Bias is measured by taking the difference
between the Meanof the set of values and the known valueof the quantity being measured.
Accuracy: The closeness of measurements to the true value of
the quantity being measured.
Accuracy depends on precision and bias, but there is nospecific formula for accuracy in terms of these two
quantities.
OnlyAccuracyis counted in terms of the of Significant Digits.
O li
-
8/10/2019 Bits Wase Dm Session 2
33/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33
Outliers
Outliers are data objects with characteristics that
are considerably different than most of the otherdata objects in the data set
Mi i V l
-
8/10/2019 Bits Wase Dm Session 2
34/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34
Missing Values
Reasons for Missing Values
Information is not collected(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases(e.g., annual income is not applicable to children)
Handling Missing Values Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their probabilities)
Inconsistent Values: Data can contain inconsistent values.
Consider an address field, where both a zip code and city are
listed, but the specified zip code area is not contained in that city.
D li t D t
-
8/10/2019 Bits Wase Dm Session 2
35/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35
Duplicate Data
Data set may include data objects that are
duplicates, or almost duplicates of one another Major issue when merging data from heterogeous
sources
Examples:
Same person with multiple email addresses
Data cleaning Process of dealing with duplicate data issues
I R l t d t A li ti
-
8/10/2019 Bits Wase Dm Session 2
36/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36
Issues Related to Applications
There are a few general issues:
Timeliness: Some data starts to age as soon as it has beencollected.
If the data is out of date, then so are the models and
patterns that are based on it.
Relevance: The available data must contain the informationnecessary for the application.
Knowledge about the Data: The data sets are accompanied
by documentation that describes different aspects of the
data; the quality of this documentation can either aid orhinder the subsequent analysis.
D t P i
-
8/10/2019 Bits Wase Dm Session 2
37/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37
Data Preprocessing
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
A ti
-
8/10/2019 Bits Wase Dm Session 2
38/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38
Aggregation
Combining two or more attributes (or objects) into
a single attribute (or object)
Purpose
Data reductionReduce the number of attributes or objects
Change of scale
Cities aggregated into regions, states, countries, etc
More stabledataAggregated data tends to have less variability
-
8/10/2019 Bits Wase Dm Session 2
39/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39
Aggregation
Standard Deviation of Average
Monthly Precipitation
Standard Deviation of Average
Yearly Precipitation
Variation of Precipitation in Australia
-
8/10/2019 Bits Wase Dm Session 2
40/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40
Sampling
Sampling is the main technique employed for data
selection. It is often used for both the preliminary investigation of the
data and the final data analysis.
Statisticians sample because obtainingthe entire set ofdata of interest is too expensive or time consuming.
Sampling is used in data mining becauseprocessingtheentire set of data of interest is too expensive or timeconsuming.
S li
-
8/10/2019 Bits Wase Dm Session 2
41/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41
Sampling
The key principle for effective sampling is the
following: using a sample will work almost as well as using the
entire data sets, if the sample is representative
A sample is representative if it has approximately thesame property (of interest) as the original set of data
Types of Sampling
-
8/10/2019 Bits Wase Dm Session 2
42/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42
Types of Sampling
Simple Random Sampling
There is an equal probability of selecting any particular item
Sampling without replacement
As each item is selected, it is removed from the population
Sampling with replacement Objects are not removed from the population as they are
selected for the sample.
In sampling with replacement, the same object can be picked upmore than once
Stratified sampling
Split the data into several partitions; then draw random samplesfrom each partition
S l Si
-
8/10/2019 Bits Wase Dm Session 2
43/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43
Sample Size
8000 points 2000 Points 500 Points
-
8/10/2019 Bits Wase Dm Session 2
44/82
Curse of Dimensionality
-
8/10/2019 Bits Wase Dm Session 2
45/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45
Curse of Dimensionality
When dimensionality
increases, data becomesincreasingly sparse in the
space that it occupies
Definitions of density anddistance between points,
which is critical for
clustering and outlier
detection, become lessmeaningful
Randomly generate 500 points
Compute difference between max and min
distance between any pair of points
Dimensionality Reduction
-
8/10/2019 Bits Wase Dm Session 2
46/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46
Dimensionality Reduction
Purpose:
Avoid curse of dimensionality Reduce amount of time and memory required by data
mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reducenoise
Techniques
Principle Component Analysis Singular Value Decomposition
Others: supervised and non-linear techniques
Dimensionality Reduction: PCA
-
8/10/2019 Bits Wase Dm Session 2
47/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47
Dimensionality Reduction: PCA
Goal is to find a projection that captures the
largest amount of variation in data
x2
x1
e
-
8/10/2019 Bits Wase Dm Session 2
48/82
Dimensionality Reduction: ISOMAP
-
8/10/2019 Bits Wase Dm Session 2
49/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49
Dimensionality Reduction: ISOMAP
Construct a neighbourhood graph
For each pair of points in the graph, compute the shortest
path distances geodesic distances
By: Tenenbaum, de Silva,
Langford (2000)
-
8/10/2019 Bits Wase Dm Session 2
50/82
Linear Algebra Techniques for Dimensionality Reduction
-
8/10/2019 Bits Wase Dm Session 2
51/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 51
Linear Algebra Techniques for Dimensionality Reduction
Principal Components Analysis (PCA) is a linear
algebra technique for continuous attributes thatfinds new attributes (principal components) that
are:-
1. linear combinations of the original attributes,
2. orthogonal (perpendicular) to each other, and
3. capture the maximum amount of variation in the
data.
Singular Value Decomposition (SVD) is a linear
algebra technique that is related to PCA and is
also commonly used for dimensionality reduction.
Feature Subset Selection
-
8/10/2019 Bits Wase Dm Session 2
52/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 52
Feature Subset Selection
Another way to reduce dimensionality of data
Redundant features
duplicate much or all of the information contained inone or more other attributes
Example: purchase price of a product and the amountof sales tax paid
Irrelevant features
contain no information that is useful for the datamining task at hand
Example: students' ID is often irrelevant to the task ofpredicting students' GPA
Feature Subset Selection contd
-
8/10/2019 Bits Wase Dm Session 2
53/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 53
Feature Subset Selection cont d.
Techniques:
Brute-force approach:
Try all possible feature subsets as input to data mining algorithm
Embedded approaches:
Feature selection occurs naturally as part of the data mining algorithm
Filter approaches:
Features are selected before data mining algorithm is run
Wrapper approaches:Use the data mining algorithm as a black box to find best subset of attributes
An Architecture for Feature Subset Selection
-
8/10/2019 Bits Wase Dm Session 2
54/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 54
An Architecture for Feature Subset Selection
As the number of subsets can be enormous and it is impractical to examine
them all, some sort of stopping criterion is necessary.
Once a subset of features has been selected, the results of the target data
mining algorithm on the selected subset should be validated.
Another approach for Validation is to use a number of different feature
selection algorithms to obtain subsets of features and then compare the
results of running the data mining algorithm on each subset. Feature Weighting: It is an alternative to keeping or eliminating
features. More important features are assigned a higher weight, while
less important features are given a lower weight.
Feature Creation: It is frequently possible to create, from the original
attributes, a new set of attributes that captures the important informationin a data set much more effectively.
Feature Extraction: The creation of a new set of features from the
original raw data is known as feature extraction.
e.g.: Extracting facial features of human, from set of Photographs.
Feature Creation
-
8/10/2019 Bits Wase Dm Session 2
55/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 55
Feature Creation
Create new attributes that can capture the important
information in a data set much more efficiently than the originalattributes
Three general methodologies:
Feature Extraction
domain-specific
Mapping Data to New Space: A totally different view of the
data can reveal important and interesting features.
Feature Construction: Sometimes the features in the
original data sets have the necessary information, but it is
not in a form suitable for the data mining algorithm.
combining features
-
8/10/2019 Bits Wase Dm Session 2
56/82
Discretization Using Class Labels
-
8/10/2019 Bits Wase Dm Session 2
57/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 57
Discretization Using Class Labels
it is often necessary to transform a continuous attribute into a categorical
attribute (discretization), and both continuous and discrete attributes may
need to be transformed into one or more binary attributes (binarization).
Entropy based approach:
3 categories for both x and y 5 categories for both x and y
Discretization Without Using Class Labels
-
8/10/2019 Bits Wase Dm Session 2
58/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 58
Discretization Without Using Class Labels
Data Equal interval width
Equal frequency K-means
Attribute Transformation
-
8/10/2019 Bits Wase Dm Session 2
59/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 59
Attribute Transformation
A function that maps the entire set of values of a
given attribute to a new set of replacement valuessuch that each old value can be identified with
one of the new values
Simple functions: xk, log(x), ex, |x|
Standardization and Normalization
Similarity and Dissimilarity
-
8/10/2019 Bits Wase Dm Session 2
60/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 60
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are.
Is higher when objects are more alike.
Often falls in the range [0,1]
Dissimilarity Numerical measure of how different are two data
objects
Lower when objects are more alike
Minimum dissimilarity is often 0 Upper limit varies
Proximity refers to a similarity or dissimilarity
Similarity and Dissimilarity contd .
-
8/10/2019 Bits Wase Dm Session 2
61/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 61
Similarity/Dissimilarity for Simple Attributes
pand qare the attribute values for two data objects.
Similarity and Dissimilarity cont d.
Euclidean Distance
-
8/10/2019 Bits Wase Dm Session 2
62/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 62
Euclidean Distance
Euclidean Distance
Where nis the number of dimensions (attributes) andpkand qkare, respectively, the kth attributes (components) or dataobjectspand q.
Standardization is necessary, if scales differ.
n
kkk qpdist
1
2)(
Euclidean Distance
-
8/10/2019 Bits Wase Dm Session 2
63/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 63
Euclidean Distance
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Distance Matrix
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Minkowski Distance
-
8/10/2019 Bits Wase Dm Session 2
64/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 64
Minkowski Distance
Minkowski Distance is a generalization of Euclidean
Distance
Where r is a parameter, n is the number of dimensions(attributes) and pk and qk are, respectively, the kth attributes(components) or data objectspand q.
rn
k
rkk qpdist
1
1
)||(
-
8/10/2019 Bits Wase Dm Session 2
65/82
Minkowski Distance
-
8/10/2019 Bits Wase Dm Session 2
66/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 66
Minkowski Distance
Distance Matrix
point x y
p1 0 2
p2 2 0p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Mahalanobis Distance
-
8/10/2019 Bits Wase Dm Session 2
67/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 67
Tqpqpqpsmahalanobi )()(),( 1
For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.
is the covariance matrix of
the input data X
n
ikikjijkj XXXXn 1
, ))((1
1
-
8/10/2019 Bits Wase Dm Session 2
68/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 68
Mahalanobis Distance
Covariance Matrix:
3.02.0
2.03.0
B
A
C
A: (0.5, 0.5)
B: (0, 1)
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
Common Properties of a Distance
-
8/10/2019 Bits Wase Dm Session 2
69/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 69
Common Properties of a Distance
Distances, such as the Euclidean distance,
have some well known properties.
1. d(p, q) 0 for all p and q and d(p, q) = 0 only ifp= q. (Positive definiteness)
2. d(p, q) = d(q, p) for allpand q. (Symmetry)
3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r.(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) betweenpoints (data objects),pand q.
A distance that satisfies these properties is ametric
Common Properties of a Similarity
-
8/10/2019 Bits Wase Dm Session 2
70/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 70
Co o ope t es o a S a ty
Similarities, also have some well known
properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (dataobjects),pand q.
Similarity Between Binary Vectors
-
8/10/2019 Bits Wase Dm Session 2
71/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 71
y y
Common situation is that objects,pand q, have only
binary attributes Compute similarities using the following quantities
M01= the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00= the number of attributes where p was 0 and q was 0
M11= the number of attributes where p was 1 and q was 1
Simple Matching and Jaccard CoefficientsSMC = number of matches / number of attributes
= (M11+ M00) / (M01+ M10+ M11+ M00)
J = number of 11 matches / number of not-both-zero attributes values
= (M11) / (M01+ M10+ M11)
SMC versus Jaccard: Example
-
8/10/2019 Bits Wase Dm Session 2
72/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 72
p
p= 1 0 0 0 0 0 0 0 0 0
q= 0 0 0 0 0 0 1 0 0 1
M01= 2 (the number of attributes where p was 0 and q was 1)
M10= 1 (the number of attributes where p was 1 and q was 0)
M00= 7 (the number of attributes where p was 0 and q was 0)M11= 0 (the number of attributes where p was 1 and q was 1)
SMC = (M11+ M00)/(M01+ M10+ M11+ M00) = (0+7) / (2+1+0+7) = 0.7
J = (M11) / (M01+ M10+ M11) = 0 / (2 + 1 + 0) = 0
Cosine Similarity
-
8/10/2019 Bits Wase Dm Session 2
73/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 73
y
If d1and d2are two document vectors, then
cos( d1, d2) = (d1 d2) / ||d1|| ||d2|| ,where indicates vector dot product and || d || is the length of vector d.
Example:
d1= 3 2 0 5 0 0 0 2 0 0
d2= 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5= (42) 0.5= 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5= (6) 0.5= 2.245
cos( d1, d2) = .3150
-
8/10/2019 Bits Wase Dm Session 2
74/82
-
8/10/2019 Bits Wase Dm Session 2
75/82
Visually Evaluating Correlation
-
8/10/2019 Bits Wase Dm Session 2
76/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 76
Scatter plots
showing thesimilarity from
1 to 1.
General Approach for Combining Similarities
-
8/10/2019 Bits Wase Dm Session 2
77/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 77
Sometimes attributes are of many different
types, but an overall similarity is needed.
Using Weights to Combine Similarities
-
8/10/2019 Bits Wase Dm Session 2
78/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 78
May not want to treat all attributes the same.
Use weights wkwhich are between 0 and 1 and sumto 1.
-
8/10/2019 Bits Wase Dm Session 2
79/82
Euclidean DensityCell-based
-
8/10/2019 Bits Wase Dm Session 2
80/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 80
Simplest approach is to divide region into a
number of rectangular cells of equal volume anddefine density as # of points the cell contains
Euclidean DensityCenter-based
-
8/10/2019 Bits Wase Dm Session 2
81/82
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 81
Euclidean density is the number of points within a
specified radius of the point
-
8/10/2019 Bits Wase Dm Session 2
82/82
THANK YOU