bits wase dm session 2

Upload: satyanarayan-reddy-k

Post on 02-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Bits Wase Dm Session 2

    1/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

    Data Mining: Data

    Lecture Notes for Session 2

    Introduction to Data Mining

    byTan, Steinbach, Kumar

    By Dr. K. Satyanarayan Reddy

  • 8/10/2019 Bits Wase Dm Session 2

    2/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

    What is Data?

    Collection of data objects and

    their attributes.

    An attribute is a property or

    characteristic of an object

    Examples: eye color of a

    person, temperature, etc.

    Attribute is also known as

    variable, field, characteristic, or

    feature & dimension.

    A collection of Attributes

    describe an Object

    Object is also known as

    Record, Point, Case, Vector,

    Pattern, Sample, Entity,

    Observationor Instance.

    Tid

    Refund

    Marital

    Status

    Taxable

    Income

    Cheat

    1 Yes

    Single 125K No

    2 No Married 100K No

    3 No Single 70K No

    4

    Yes

    Married 120K No

    5 No Divorced 95K Yes

    6 No Married 60K No

    7 Yes

    Divorced 220KNo

    8 No Single 85K Yes

    9 No Married 75K No

    10 No Single 90K Yes10

    Attributes

    Objects

  • 8/10/2019 Bits Wase Dm Session 2

    3/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

    Data

    Definition 1: An attribute is a property or

    characteristic of an object that may vary; eitherfrom one object to another or from one time to

    another.

    Definition 2: A measurement scale is a rule(function) that associates a numerical or symbolic

    value with an attribute of an object.

    The process of Measurementis the application of a

    measurement scale to associate a value with aparticular attribute of a specific object.

  • 8/10/2019 Bits Wase Dm Session 2

    4/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

    Attribute Values

    Attribute values are numbers or symbols assignedto an attribute.

    Distinction between attributes and attribute values

    Same attribute can be mapped to different attributevalues

    Example: height can be measured in feet or meters

    Different attributes can be mapped to the same set of

    valuesExample: Attribute values for ID and age are integers.

    But properties of attribute values can be different

    ID has no limit but age has a maximum and minimumvalue.

  • 8/10/2019 Bits Wase Dm Session 2

    5/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

    Measurement of Length

    The way you measure an attribute is somewhat may not match theattributes properties.

    1

    2

    3

    5

    5

    7

    8

    15

    10 4

    A

    B

    C

    D

    E

  • 8/10/2019 Bits Wase Dm Session 2

    6/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6

    Types of Attributes

    There are different types of attributes

    Nominal

    Examples: ID numbers, eye color, zip codes

    Ordinal

    Examples: rankings (e.g., taste of potato chips on a scale

    from 1-10), grades, height in {tall, medium, short}

    Interval

    Examples: calendar dates, temperatures in Celsius or

    Fahrenheit.

    Ratio

    Examples: temperature in Kelvin, length, time, counts

  • 8/10/2019 Bits Wase Dm Session 2

    7/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7

    Properties of Attribute Values

    The type of an attribute depends on which of the

    following properties it possesses:

    Distinctness: =

    Order: < >

    Addition: + - Multiplication: * /

    Nominal attribute: distinctness

    Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition

    Ratio attribute: all 4 properties

  • 8/10/2019 Bits Wase Dm Session 2

    8/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

    Types of Attributes

    Nominal and ordinal attributes are collectively

    referred to as Categoricalor Qualitativeattributes.

    As the name suggests, qualitative attributes, such as

    employee ID, lack most of the properties of

    numbers.

    The remaining two types of attributes, Intervaland

    Ratio, are collectively referred to as Quantitativeor

    Numeric Attributes.

    Quantitative attributes are represented by numbersand have most of the properties of numbers.

    Note: Quantitative attributes can be integer-valued or

    continuous.

  • 8/10/2019 Bits Wase Dm Session 2

    9/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9

  • 8/10/2019 Bits Wase Dm Session 2

    10/82

    Attribute

    Type

    Description Examples Operations

    Nominal The values of a nominal attribute are

    just different names, i.e., nominal

    attributes provide only enoughinformation to distinguish one object

    from another. (=, )

    zip codes, employee

    ID numbers, eye color,

    sex: {male, female}

    mode, entropy,

    contingency

    correlation, 2test

    Ordinal The values of an ordinal attribute

    provide enough information to order

    objects. ()

    hardness of minerals,

    {good, better, best},

    grades, street numbers

    median, percentiles,

    rank correlation,

    run tests, sign tests

    Interval For interval attributes, the

    differences between values are

    meaningful, i.e., a unit of

    measurement exists.

    (+, - )

    calendar dates,

    temperature in Celsius

    or Fahrenheit

    mean, standard

    deviation, Pearson's

    correlation, tandF

    tests

    Ratio For ratio variables, both differences

    and ratios are meaningful. (*, /)

    temperature in Kelvin,

    monetary quantities,

    counts, age, mass,

    length, electrical

    current

    geometric mean,

    harmonic mean,

    percent variation

  • 8/10/2019 Bits Wase Dm Session 2

    11/82

    Attribute

    Level

    Transformation Comments

    Nominal Any permutation of values If all employee ID numbers

    were reassigned, would itmake any difference?

    Ordinal An order preserving change of

    values, i.e.,

    new_value = f(old_value)wherefis a monotonic function.

    An attribute encompassing

    the notion of good, better

    best can be representedequally well by the values

    {1, 2, 3} or by { 0.5, 1,

    10}.

    Interval new_value =a * old_value + b

    where a and b are constants

    Thus, the Fahrenheit and

    Celsius temperature scales

    differ in terms of wheretheir zero value is and the

    size of a unit (degree).

    Ratio new_value = a * old_value Length can be measured in

    meters or feet.

  • 8/10/2019 Bits Wase Dm Session 2

    12/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

    Discrete and Continuous Attributes

    Discrete Attribute

    Has only a finite or countably infinite set of values

    Examples: zip codes, counts, or the set of words in a collection ofdocuments

    Often represented as integer variables.

    Note: binary attributes are a special case of discrete attributes

    Continuous Attribute Has real numbers as attribute values

    Examples: temperature, height, or weight.

    Practically, real values can only be measured and represented using a finitenumber of digits.

    Continuous attributes are typically represented as floating-point variables. Asymmetric Attributes: Binary attributes where only non-zero values are important

    are called asymmetric binary attributes.

    Example: If the number of credits associated with each course is recorded, then the

    resulting data set will consist of asymmetric discrete or continuous attributes.

  • 8/10/2019 Bits Wase Dm Session 2

    13/82

  • 8/10/2019 Bits Wase Dm Session 2

    14/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

    Important Characteristics of Structured Data

    Dimensionality: The dimensionality of a data set is the number of attributes that the objectsin the data set possess.

    Curse of Dimensionality: Data with a small number of dimensions tends to be qualitativelydifferent than moderate or high-dimensional data.

    The difficulties associated with analyzing high-dimensional data are sometimes referred to as the

    curse of dimensionality.

    Sparsity

    Only Presence Counts: For some data sets, such as those with asymmetric features, most

    attributes of an object have values of 0; in many cases fewer than 1% of the entries are non-zero.

    In practical terms, sparsity is an advantage because usually only the non-zero values need to be

    stored and manipulated. This results in significant savings with respect to computation time andstorage.

    Resolution

    Patterns depend on the scale

    It is frequently possible to obtain data at different levels of resoIution, and often the properties of

    the data are different at different resolutions.

    Example: The surface of the Earth seems very uneven at a resolution of a few meters, but isrelatively smooth at a resolution of tens of kilometers.

  • 8/10/2019 Bits Wase Dm Session 2

    15/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

    Record Data

    Data that consists of a collection of records(data

    objects), each of which consists of a fixed set ofdata fields (attributes).

    Tid Refund Marital

    Status

    Taxable

    Income Cheat

    1 Yes Single 125K No

    2 No Married 100K No

    3 No Single 70K No

    4 Yes Married 120K No

    5 No Divorced 95K Yes

    6 No Married 60KNo

    7 Yes Divorced 220K No

    8 No Single 85K Yes

    9 No Married 75K No

    10 No Single 90K Yes10

  • 8/10/2019 Bits Wase Dm Session 2

    16/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

    Transaction or Market Basket Data

    Transaction Data is a special type of record data, where each record (transaction) involves

    a set of items.

    Consider a grocery store where a set of products purchased by a customer during one

    shopping trip constitutes a Transaction, while the individual products that were

    purchased are the Items.

    This type of data is called Market Basket Data because the items in each record are the

    products in a person's "market basket."

    Transaction data is a collection of sets of items, but it can

    be viewed as a set of records whose fields are asymmetric

    attributes.

    Often, the attributes are binary, indicating whether or not an

    item was purchased, but the attributes can be discrete orcontinuous, such as the number of items purchased or the

    amount spent on those items.

  • 8/10/2019 Bits Wase Dm Session 2

    17/82

  • 8/10/2019 Bits Wase Dm Session 2

    18/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

    The Sparse Data Matrix

    A Sparse Data Matrix is a special case

    of a Data Matrix in which the attributes

    are of the same type and are

    asymmetric; i.e., only non-zero valuesare important.

    Transaction data is an example of a

    sparse data matrix that has only 0 - 1entries.

    A common example is Document Data.

  • 8/10/2019 Bits Wase Dm Session 2

    19/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

    Document Data

    Each document becomes a `term' vector,

    each term is a component (attribute) of the vector,

    the value of each component is the number of times

    the corresponding term occurs in the document.

    Document 1

    season

    timeout

    lost

    wi

    ngame

    score

    ball

    play

    coach

    team

    Document 2

    Document 3

    3 0 5 0 2 6 0 2 0 2

    0

    0

    7 0 2 1 0 0 3 0 0

    1 0 0 1 2 2 0 3 0

  • 8/10/2019 Bits Wase Dm Session 2

    20/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

    Transaction Data

    A special type of record data, where

    each record (transaction) involves a set of items.

    For example, consider a grocery store. The set of

    products purchased by a customer during one

    shopping trip constitute a transaction, while the

    individual products that were purchased are the items.

    TI D I tems

    1 Bread, Coke, Milk

    2 Beer, Bread

    3 Beer, Coke, Diaper, Milk

    4 Beer, Bread, Diaper, Milk

    5 Coke, Diaper, Milk

  • 8/10/2019 Bits Wase Dm Session 2

    21/82

  • 8/10/2019 Bits Wase Dm Session 2

    22/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

    Graph Data contd

    Examples: Generic graph and HTML Links

    5

    2

    12

    5

    Data Mining

    Graph Partitioning

    Parallel Solution of Sparse Linear System of Equations

    N-Body Computation and Dense Linear System Solvers

    Linked Web pages

  • 8/10/2019 Bits Wase Dm Session 2

    23/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23

    Chemical Data

    Benzene Molecule: C6H6

  • 8/10/2019 Bits Wase Dm Session 2

    24/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24

    Ordered Data

    Sequences of Transactions: For some types of data, the

    attributes have relationships that involve order in time or space.

    An element of

    the sequence

    Items/Events

  • 8/10/2019 Bits Wase Dm Session 2

    25/82

  • 8/10/2019 Bits Wase Dm Session 2

    26/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26

    Sequence Data Sequence data consists of a data set that is a

    sequence of individual entities, such as a sequence of words orletters. It is quite similar to sequential data, except that there are

    no time stamps; instead, there are positions in an ordered

    sequence.

    GGTTCCGCCTTCAGCCCCGCGCC

    CGCAGGGCCCGCCCCGCGCCGTC

    GAGAAGGGCCCGCCTGGCGGGCG

    GGGGGAGGCGGGGCCGCCCGAGC

    CCAACCGAGTCCGACCAGGTGCC

    CCCTCTGCTCGGCCTAGACCTGAGCTCATTAGGCGGCAGCGGACAG

    GCCAAGTAGAACACGCGAAGCGC

    TGGGCTGCCTGCTGCGACCAGGG

    Example: Genomic sequence data:

    the genetic information of plants and

    animals can be represented in the

    form of sequences of nucleotides

    that are known as genes.

    Many of the problems associated

    with genetic sequence data involvepredicting similarities in the structure

    and function of genes from

    similarities in nucleotide sequences.

    Ordered Data contd.

  • 8/10/2019 Bits Wase Dm Session 2

    27/82

  • 8/10/2019 Bits Wase Dm Session 2

    28/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28

    Ordered Data

    Spatial Data: Some objects have spatial attributes, such as positions or areas,

    as well as other types of attributes.

    e.g.: Weather data (precipitation, temperature, pressure) that is collected for a

    variety of geographical locations.

    An important aspect of spatial data is spatial autocorrelation; i.e., objects that

    are physically close tend to be similar in other ways as well. Thus, two points

    on the Earth that are close to each other usually have similar values for

    temperature and rainfall.

    Spatio-Temporal Data:

    Average Monthly

    Temperature ofland and ocean

  • 8/10/2019 Bits Wase Dm Session 2

    29/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29

    Data Quality

    Data Mining focuses on the

    Detection and correction of data quality problems and

    Use of algorithms that can tolerate poor data quality.

    The first step, detection and correction, is often called Data

    Cleaning.

    What kinds of data quality problems?

    How can we detect problems with the data?

    What can we do about these problems?

    Examples of data quality problems:

    Noise and outliers

    missing values

    duplicate data

  • 8/10/2019 Bits Wase Dm Session 2

    30/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30

    Issues with Data

    Measurement and Data Collection Issues: Perfect data cantbe expected. There may be

    problems due to human error, limitations of measuring devices, or flaws in the data

    collection process.

    Values or even entire data objects may be missing or there may be spurious or duplicate

    objects; i.e., multiple data objects that all correspond to a single "real" object.

    e.g.: There might be two different records for a person who has recently lived at two

    different addresses.

    Even if all the data is present and "looks fine," there may be inconsistencies-a person

    has a height of 2 meters, but weighs only 2 kilograms.

    Measurement and Data Collection Errors: The term Measurement error refers to any

    problem resulting from the measurement process it may be the value recordeddiffers

    from the true value.

    For continuous attributes, the numerical difference of the measured and true value is

    called the error.

    The term data collection error refers to errors such as omitting data objects or attribute

    values, or inappropriately including a data object.

    e.g.: A study of animals of a certain species might include animals of a related species

    that are similar in appearance to the species of interest. Both measurement errors and

    data collection errors can be either systematic or random.

  • 8/10/2019 Bits Wase Dm Session 2

    31/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31

    Noise

    Noise refers to modification of original values

    Examples: distortion of a personsvoice when talking

    on a poor phone and snowon television screen

    Two Sine Waves Two Sine Waves + Noise

    d

  • 8/10/2019 Bits Wase Dm Session 2

    32/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32

    Precision, Bias and Accuracy

    Precision: The closeness of repeated measurements (of the

    same quantity) to one another.Bias: A systematic quantity being measured.

    Precisionis often measured by the Standard Deviationof a set

    of values, while Bias is measured by taking the difference

    between the Meanof the set of values and the known valueof the quantity being measured.

    Accuracy: The closeness of measurements to the true value of

    the quantity being measured.

    Accuracy depends on precision and bias, but there is nospecific formula for accuracy in terms of these two

    quantities.

    OnlyAccuracyis counted in terms of the of Significant Digits.

    O li

  • 8/10/2019 Bits Wase Dm Session 2

    33/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33

    Outliers

    Outliers are data objects with characteristics that

    are considerably different than most of the otherdata objects in the data set

    Mi i V l

  • 8/10/2019 Bits Wase Dm Session 2

    34/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34

    Missing Values

    Reasons for Missing Values

    Information is not collected(e.g., people decline to give their age and weight)

    Attributes may not be applicable to all cases(e.g., annual income is not applicable to children)

    Handling Missing Values Eliminate Data Objects

    Estimate Missing Values

    Ignore the Missing Value During Analysis

    Replace with all possible values (weighted by their probabilities)

    Inconsistent Values: Data can contain inconsistent values.

    Consider an address field, where both a zip code and city are

    listed, but the specified zip code area is not contained in that city.

    D li t D t

  • 8/10/2019 Bits Wase Dm Session 2

    35/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35

    Duplicate Data

    Data set may include data objects that are

    duplicates, or almost duplicates of one another Major issue when merging data from heterogeous

    sources

    Examples:

    Same person with multiple email addresses

    Data cleaning Process of dealing with duplicate data issues

    I R l t d t A li ti

  • 8/10/2019 Bits Wase Dm Session 2

    36/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36

    Issues Related to Applications

    There are a few general issues:

    Timeliness: Some data starts to age as soon as it has beencollected.

    If the data is out of date, then so are the models and

    patterns that are based on it.

    Relevance: The available data must contain the informationnecessary for the application.

    Knowledge about the Data: The data sets are accompanied

    by documentation that describes different aspects of the

    data; the quality of this documentation can either aid orhinder the subsequent analysis.

    D t P i

  • 8/10/2019 Bits Wase Dm Session 2

    37/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37

    Data Preprocessing

    Aggregation

    Sampling

    Dimensionality Reduction

    Feature subset selection

    Feature creation

    Discretization and Binarization

    Attribute Transformation

    A ti

  • 8/10/2019 Bits Wase Dm Session 2

    38/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38

    Aggregation

    Combining two or more attributes (or objects) into

    a single attribute (or object)

    Purpose

    Data reductionReduce the number of attributes or objects

    Change of scale

    Cities aggregated into regions, states, countries, etc

    More stabledataAggregated data tends to have less variability

  • 8/10/2019 Bits Wase Dm Session 2

    39/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39

    Aggregation

    Standard Deviation of Average

    Monthly Precipitation

    Standard Deviation of Average

    Yearly Precipitation

    Variation of Precipitation in Australia

  • 8/10/2019 Bits Wase Dm Session 2

    40/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40

    Sampling

    Sampling is the main technique employed for data

    selection. It is often used for both the preliminary investigation of the

    data and the final data analysis.

    Statisticians sample because obtainingthe entire set ofdata of interest is too expensive or time consuming.

    Sampling is used in data mining becauseprocessingtheentire set of data of interest is too expensive or timeconsuming.

    S li

  • 8/10/2019 Bits Wase Dm Session 2

    41/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41

    Sampling

    The key principle for effective sampling is the

    following: using a sample will work almost as well as using the

    entire data sets, if the sample is representative

    A sample is representative if it has approximately thesame property (of interest) as the original set of data

    Types of Sampling

  • 8/10/2019 Bits Wase Dm Session 2

    42/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42

    Types of Sampling

    Simple Random Sampling

    There is an equal probability of selecting any particular item

    Sampling without replacement

    As each item is selected, it is removed from the population

    Sampling with replacement Objects are not removed from the population as they are

    selected for the sample.

    In sampling with replacement, the same object can be picked upmore than once

    Stratified sampling

    Split the data into several partitions; then draw random samplesfrom each partition

    S l Si

  • 8/10/2019 Bits Wase Dm Session 2

    43/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43

    Sample Size

    8000 points 2000 Points 500 Points

  • 8/10/2019 Bits Wase Dm Session 2

    44/82

    Curse of Dimensionality

  • 8/10/2019 Bits Wase Dm Session 2

    45/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45

    Curse of Dimensionality

    When dimensionality

    increases, data becomesincreasingly sparse in the

    space that it occupies

    Definitions of density anddistance between points,

    which is critical for

    clustering and outlier

    detection, become lessmeaningful

    Randomly generate 500 points

    Compute difference between max and min

    distance between any pair of points

    Dimensionality Reduction

  • 8/10/2019 Bits Wase Dm Session 2

    46/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46

    Dimensionality Reduction

    Purpose:

    Avoid curse of dimensionality Reduce amount of time and memory required by data

    mining algorithms

    Allow data to be more easily visualized

    May help to eliminate irrelevant features or reducenoise

    Techniques

    Principle Component Analysis Singular Value Decomposition

    Others: supervised and non-linear techniques

    Dimensionality Reduction: PCA

  • 8/10/2019 Bits Wase Dm Session 2

    47/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47

    Dimensionality Reduction: PCA

    Goal is to find a projection that captures the

    largest amount of variation in data

    x2

    x1

    e

  • 8/10/2019 Bits Wase Dm Session 2

    48/82

    Dimensionality Reduction: ISOMAP

  • 8/10/2019 Bits Wase Dm Session 2

    49/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49

    Dimensionality Reduction: ISOMAP

    Construct a neighbourhood graph

    For each pair of points in the graph, compute the shortest

    path distances geodesic distances

    By: Tenenbaum, de Silva,

    Langford (2000)

  • 8/10/2019 Bits Wase Dm Session 2

    50/82

    Linear Algebra Techniques for Dimensionality Reduction

  • 8/10/2019 Bits Wase Dm Session 2

    51/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 51

    Linear Algebra Techniques for Dimensionality Reduction

    Principal Components Analysis (PCA) is a linear

    algebra technique for continuous attributes thatfinds new attributes (principal components) that

    are:-

    1. linear combinations of the original attributes,

    2. orthogonal (perpendicular) to each other, and

    3. capture the maximum amount of variation in the

    data.

    Singular Value Decomposition (SVD) is a linear

    algebra technique that is related to PCA and is

    also commonly used for dimensionality reduction.

    Feature Subset Selection

  • 8/10/2019 Bits Wase Dm Session 2

    52/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 52

    Feature Subset Selection

    Another way to reduce dimensionality of data

    Redundant features

    duplicate much or all of the information contained inone or more other attributes

    Example: purchase price of a product and the amountof sales tax paid

    Irrelevant features

    contain no information that is useful for the datamining task at hand

    Example: students' ID is often irrelevant to the task ofpredicting students' GPA

    Feature Subset Selection contd

  • 8/10/2019 Bits Wase Dm Session 2

    53/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 53

    Feature Subset Selection cont d.

    Techniques:

    Brute-force approach:

    Try all possible feature subsets as input to data mining algorithm

    Embedded approaches:

    Feature selection occurs naturally as part of the data mining algorithm

    Filter approaches:

    Features are selected before data mining algorithm is run

    Wrapper approaches:Use the data mining algorithm as a black box to find best subset of attributes

    An Architecture for Feature Subset Selection

  • 8/10/2019 Bits Wase Dm Session 2

    54/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 54

    An Architecture for Feature Subset Selection

    As the number of subsets can be enormous and it is impractical to examine

    them all, some sort of stopping criterion is necessary.

    Once a subset of features has been selected, the results of the target data

    mining algorithm on the selected subset should be validated.

    Another approach for Validation is to use a number of different feature

    selection algorithms to obtain subsets of features and then compare the

    results of running the data mining algorithm on each subset. Feature Weighting: It is an alternative to keeping or eliminating

    features. More important features are assigned a higher weight, while

    less important features are given a lower weight.

    Feature Creation: It is frequently possible to create, from the original

    attributes, a new set of attributes that captures the important informationin a data set much more effectively.

    Feature Extraction: The creation of a new set of features from the

    original raw data is known as feature extraction.

    e.g.: Extracting facial features of human, from set of Photographs.

    Feature Creation

  • 8/10/2019 Bits Wase Dm Session 2

    55/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 55

    Feature Creation

    Create new attributes that can capture the important

    information in a data set much more efficiently than the originalattributes

    Three general methodologies:

    Feature Extraction

    domain-specific

    Mapping Data to New Space: A totally different view of the

    data can reveal important and interesting features.

    Feature Construction: Sometimes the features in the

    original data sets have the necessary information, but it is

    not in a form suitable for the data mining algorithm.

    combining features

  • 8/10/2019 Bits Wase Dm Session 2

    56/82

    Discretization Using Class Labels

  • 8/10/2019 Bits Wase Dm Session 2

    57/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 57

    Discretization Using Class Labels

    it is often necessary to transform a continuous attribute into a categorical

    attribute (discretization), and both continuous and discrete attributes may

    need to be transformed into one or more binary attributes (binarization).

    Entropy based approach:

    3 categories for both x and y 5 categories for both x and y

    Discretization Without Using Class Labels

  • 8/10/2019 Bits Wase Dm Session 2

    58/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 58

    Discretization Without Using Class Labels

    Data Equal interval width

    Equal frequency K-means

    Attribute Transformation

  • 8/10/2019 Bits Wase Dm Session 2

    59/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 59

    Attribute Transformation

    A function that maps the entire set of values of a

    given attribute to a new set of replacement valuessuch that each old value can be identified with

    one of the new values

    Simple functions: xk, log(x), ex, |x|

    Standardization and Normalization

    Similarity and Dissimilarity

  • 8/10/2019 Bits Wase Dm Session 2

    60/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 60

    Similarity and Dissimilarity

    Similarity

    Numerical measure of how alike two data objects are.

    Is higher when objects are more alike.

    Often falls in the range [0,1]

    Dissimilarity Numerical measure of how different are two data

    objects

    Lower when objects are more alike

    Minimum dissimilarity is often 0 Upper limit varies

    Proximity refers to a similarity or dissimilarity

    Similarity and Dissimilarity contd .

  • 8/10/2019 Bits Wase Dm Session 2

    61/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 61

    Similarity/Dissimilarity for Simple Attributes

    pand qare the attribute values for two data objects.

    Similarity and Dissimilarity cont d.

    Euclidean Distance

  • 8/10/2019 Bits Wase Dm Session 2

    62/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 62

    Euclidean Distance

    Euclidean Distance

    Where nis the number of dimensions (attributes) andpkand qkare, respectively, the kth attributes (components) or dataobjectspand q.

    Standardization is necessary, if scales differ.

    n

    kkk qpdist

    1

    2)(

    Euclidean Distance

  • 8/10/2019 Bits Wase Dm Session 2

    63/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 63

    Euclidean Distance

    0

    1

    2

    3

    0 1 2 3 4 5 6

    p1

    p2

    p3 p4

    point x y

    p1 0 2

    p2 2 0

    p3 3 1

    p4 5 1

    Distance Matrix

    p1 p2 p3 p4

    p1 0 2.828 3.162 5.099

    p2 2.828 0 1.414 3.162

    p3 3.162 1.414 0 2

    p4 5.099 3.162 2 0

    Minkowski Distance

  • 8/10/2019 Bits Wase Dm Session 2

    64/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 64

    Minkowski Distance

    Minkowski Distance is a generalization of Euclidean

    Distance

    Where r is a parameter, n is the number of dimensions(attributes) and pk and qk are, respectively, the kth attributes(components) or data objectspand q.

    rn

    k

    rkk qpdist

    1

    1

    )||(

  • 8/10/2019 Bits Wase Dm Session 2

    65/82

    Minkowski Distance

  • 8/10/2019 Bits Wase Dm Session 2

    66/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 66

    Minkowski Distance

    Distance Matrix

    point x y

    p1 0 2

    p2 2 0p3 3 1

    p4 5 1

    L1 p1 p2 p3 p4

    p1 0 4 4 6p2 4 0 2 4

    p3 4 2 0 2

    p4 6 4 2 0

    L2 p1 p2 p3 p4

    p1 0 2.828 3.162 5.099

    p2 2.828 0 1.414 3.162

    p3 3.162 1.414 0 2

    p4 5.099 3.162 2 0

    L p1 p2 p3 p4

    p1 0 2 3 5

    p2 2 0 1 3

    p3 3 1 0 2

    p4 5 3 2 0

    Mahalanobis Distance

  • 8/10/2019 Bits Wase Dm Session 2

    67/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 67

    Tqpqpqpsmahalanobi )()(),( 1

    For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

    is the covariance matrix of

    the input data X

    n

    ikikjijkj XXXXn 1

    , ))((1

    1

  • 8/10/2019 Bits Wase Dm Session 2

    68/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 68

    Mahalanobis Distance

    Covariance Matrix:

    3.02.0

    2.03.0

    B

    A

    C

    A: (0.5, 0.5)

    B: (0, 1)

    C: (1.5, 1.5)

    Mahal(A,B) = 5

    Mahal(A,C) = 4

    Common Properties of a Distance

  • 8/10/2019 Bits Wase Dm Session 2

    69/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 69

    Common Properties of a Distance

    Distances, such as the Euclidean distance,

    have some well known properties.

    1. d(p, q) 0 for all p and q and d(p, q) = 0 only ifp= q. (Positive definiteness)

    2. d(p, q) = d(q, p) for allpand q. (Symmetry)

    3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r.(Triangle Inequality)

    where d(p, q) is the distance (dissimilarity) betweenpoints (data objects),pand q.

    A distance that satisfies these properties is ametric

    Common Properties of a Similarity

  • 8/10/2019 Bits Wase Dm Session 2

    70/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 70

    Co o ope t es o a S a ty

    Similarities, also have some well known

    properties.

    1. s(p, q) = 1 (or maximum similarity) only if p = q.

    2. s(p, q) = s(q, p) for all p and q. (Symmetry)

    where s(p, q) is the similarity between points (dataobjects),pand q.

    Similarity Between Binary Vectors

  • 8/10/2019 Bits Wase Dm Session 2

    71/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 71

    y y

    Common situation is that objects,pand q, have only

    binary attributes Compute similarities using the following quantities

    M01= the number of attributes where p was 0 and q was 1

    M10 = the number of attributes where p was 1 and q was 0

    M00= the number of attributes where p was 0 and q was 0

    M11= the number of attributes where p was 1 and q was 1

    Simple Matching and Jaccard CoefficientsSMC = number of matches / number of attributes

    = (M11+ M00) / (M01+ M10+ M11+ M00)

    J = number of 11 matches / number of not-both-zero attributes values

    = (M11) / (M01+ M10+ M11)

    SMC versus Jaccard: Example

  • 8/10/2019 Bits Wase Dm Session 2

    72/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 72

    p

    p= 1 0 0 0 0 0 0 0 0 0

    q= 0 0 0 0 0 0 1 0 0 1

    M01= 2 (the number of attributes where p was 0 and q was 1)

    M10= 1 (the number of attributes where p was 1 and q was 0)

    M00= 7 (the number of attributes where p was 0 and q was 0)M11= 0 (the number of attributes where p was 1 and q was 1)

    SMC = (M11+ M00)/(M01+ M10+ M11+ M00) = (0+7) / (2+1+0+7) = 0.7

    J = (M11) / (M01+ M10+ M11) = 0 / (2 + 1 + 0) = 0

    Cosine Similarity

  • 8/10/2019 Bits Wase Dm Session 2

    73/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 73

    y

    If d1and d2are two document vectors, then

    cos( d1, d2) = (d1 d2) / ||d1|| ||d2|| ,where indicates vector dot product and || d || is the length of vector d.

    Example:

    d1= 3 2 0 5 0 0 0 2 0 0

    d2= 1 0 0 0 0 0 0 1 0 2

    d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

    ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5= (42) 0.5= 6.481

    ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5= (6) 0.5= 2.245

    cos( d1, d2) = .3150

  • 8/10/2019 Bits Wase Dm Session 2

    74/82

  • 8/10/2019 Bits Wase Dm Session 2

    75/82

    Visually Evaluating Correlation

  • 8/10/2019 Bits Wase Dm Session 2

    76/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 76

    Scatter plots

    showing thesimilarity from

    1 to 1.

    General Approach for Combining Similarities

  • 8/10/2019 Bits Wase Dm Session 2

    77/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 77

    Sometimes attributes are of many different

    types, but an overall similarity is needed.

    Using Weights to Combine Similarities

  • 8/10/2019 Bits Wase Dm Session 2

    78/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 78

    May not want to treat all attributes the same.

    Use weights wkwhich are between 0 and 1 and sumto 1.

  • 8/10/2019 Bits Wase Dm Session 2

    79/82

    Euclidean DensityCell-based

  • 8/10/2019 Bits Wase Dm Session 2

    80/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 80

    Simplest approach is to divide region into a

    number of rectangular cells of equal volume anddefine density as # of points the cell contains

    Euclidean DensityCenter-based

  • 8/10/2019 Bits Wase Dm Session 2

    81/82

    Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 81

    Euclidean density is the number of points within a

    specified radius of the point

  • 8/10/2019 Bits Wase Dm Session 2

    82/82

    THANK YOU