introduction. 1.data mining and knowledge discovery 2.data mining methods 3.supervised learning...

Introduction

Introduction

1. Data Mining and Knowledge Discovery2. Data Mining Methods3. Supervised Learning4. Unsupervised Learning5. Other Learning Paradigms6. Introduction to Data Preprocessing

Data Mining and Knowledge Discovery

• Vast amounts of data are around us in our world, raw data that is mainly intractable for human or manual applications.

• Data Mining (DM) is about solving problems by analyzing data present in real databases.

• DM is distinghished as synonym of the Knowledge Discovery in Databases (KDD) process, or as the main step of KDD.


• KDD definition: “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”

• Stages:– Problem Specification.– Problem Understanding.– Data Preprocessing.– Data Mining.– Evaluation.– Results Exploitation.


KDD process:

Introduction


Data Mining Methods

Data Mining Methods

• Statistical Methods:– Regression Models:• They are used in estimation tasks, requiring the class of

equation modelling to be used.• Linear, quadratic and logistic regression are the most

well known.• They may have problems with missing values, outliers

and redundant/harmful features.

Data Mining Methods

• Statistical Methods:– Artificial Neural Networks (ANNs):• They are powerful mathematical models suitable for

almost all DM tasks, especially predictive one.• Multi-layer perceptron (MLP), Radial Basis Function

Networks (RBFNs) and Learning Vector Quantization (LVQ) are the most well known.• They require numeric attributes and may have

problems with missing values.• They are robust to outliers and noise.

Data Mining Methods

• Statistical Methods:– Bayesian Learning:• It uses the probability theory as a framework for

making rational decisions under uncertainty.• Naïve Bayes is the most well known technique.• They are very sensitive to the redundancy and

usefulness of some of the attributes and examples from the data, together with noisy and outliers examples.• They require nominal attributes and cannot deal with

missing values.

Data Mining Methods

• Statistical Methods:– Instance-based Learning:• The examples are stored verbatim, and a distance

function is used to determine which members of the database are closest to a new example with a desirable prediction.• The K-Nearest Neighbor (KNN) is the most

representative method.• They are good candidates to be improved through data

reduction procedures.

Data Mining Methods

• Statistical Methods:– Support Vector Machines (SVMs):• They are machine learning algorithms based on

learning theory and similar to ANNs in the sense that they are used for estimation and perform very well when data is linearly separable.• They require numeric non-missing data and are

commonly robust against noise and outliers.

Data Mining Methods

• Symbolic Methods:– Rule Learning:• They search for a rule that explains some part of the data,

separate these examples and recursively conquer the remaining examples.

• AQ, CN2, RIPPER, PART and FURIA are good examples of this family.

• They require numeric non-missing data and are commonly robust against noise and outliers.

• They require nominal data (sometimes with an implicit process) and dispose of an innate selector of interesting attributes from data.

Data Mining Methods

• Symbolic Methods:– Decision Trees:• They construct predictive models formed by iterations

of a divide and conquer scheme of hierarchical decisions.• CART, C4.5 and PUBLIC are good examples of this

family.• They are closely related to rule learning methods and

suffer from the same disadvantage as them.

Data Mining Methods

• Data descriptive tasks:– Clustering:• It appears when there is no class information to be

predicted but the examples must be divided into natural groups or clusters.• Well known examples of clustering algorithms are k-

Means, COBWEB and Self Organizing Maps.• They prefer numeric data together with no-missing

data and the absence of noise and outliers.

Data Mining Methods

• Data descriptive tasks:– Association Rules:• Set of techniques that aim to find association

relationships in the data.• The Apriori technique is the most emblematic

technique to address this problem.• Data transformation (mainly discretization) and

reduction is often needed to perform high quality analysis in this DM problem.

Introduction


Supervised Learning

• Prediction methods are commonly referred to as supervised learning. Supervised methods are thought to attempt the discovery of the relationships between input attributes and a target attribute.

• A training set is given and the objective is to form a description that can be used to predict unseen examples.

Supervised Learning• Problems:– Classification

• The domain of the target attribute is finite and categorical.• A classifier must assign a class to a unseen example.

– Regression• The target attribute is formed by infinite values.• To fit a model to learn the output target attribute as a

function of input attributes.

– Time Series Analysis• Making predictions in time.

Introduction


Unsupervised Learning

• There is no supervisor and only input data is available.

• The aim is now to find regularities, irregularities, relationships, similarities and associations in the input.

Unsupervised Learning• Problems:– Clustering– Association Rules– Pattern Mining• It is adopted as amore general term than frequent

pattern mining or association mining.

– Outlier Detection• Ot is the process of finding data examples with

behaviours that are very different from the expectation (outliers or anomalies).

Introduction


Other Learning Paradigms

• Imbalanced Learning– A classification problem where the data has

exceptional distribution on the target attribute.– The number of examples representing the class of

interest is much lower than that of the other classes.

• Multi-instance Learning– imposed restrictions on models in which each

example consists of a bag of instances instead of an unique instance.

Other Learning Paradigms

• Multi-label Classification– Each instance is associated not with a class, but

instead with a subset of them.

• Semi-supervised Learning– It is concerned with the design of models in the

presence of both labeled and unlabeled data.– Semi-supervised classification and Semi-

supervised clustering.– Relationship with Active Learning.

Other Learning Paradigms• Subgroup Discovery– It is formed as the result of the hybridization

between classification and association mining. – They aim to extract interesting rules with respect to a

target attribute.

• Transfer Learning– Aims to extract the knowledge from one or more

source tasks and apply the knowledge to a target task.

– The so-called data shift problem is closely related.

Other Learning Paradigms• Data Stream Learning– When all data is not available at a specific

moment, it is necessary to develop learning algorithms that treat the input as a continuous data stream.

– Each instance can be inspected only once and must then be discarded to make room for subsequent instances.

Introduction


Introduction to Data Preprocessing

• Unfortunately, real-world databases are highly influenced by negative factors such the presence of noise, missing values, inconsistent and superfluous data and huge sizes in both dimensions, examples and features.

• Low-quality data will lead to low-quality DM performance.


• Forms of Data Preparation


• Data Cleaning– Correct bad data, filter some incorrect data out of

the data set and reduce the unnecessary detail of data.

• Data Transformation– The data is consolidated so that the mining process

result could be applied or may be more efficient.

• Data Integration– Merging of data from multiple data stores.


• Data Normalization– To express data in the same measurements units,

scale or range.

• Missing Data Imputation– To fill the variables that contain missing values

with some intuitive data.

• Noise Identification– To detect random errors or variances in a

measured variable.


Forms of Data Reduction


• Feature Selection– Achieves the reduction of the data set by

removing irrelevant or redundant features (or dimensions).

• Instance Selection– Consists of choosing a subset of the total available

data to achieve the original purpose of the DM application as if the whole data had been used.


• Discretization– Transforms quantitative data into qualitative data,

that is, numerical attributes into nominal attributes with a finite number of intervals.

• Feature Extraction/Instance Generation– Extends both the feature and instance selection

by allowing the modification of the internal values that represent each example or attribute.

introduction. 1.data mining and knowledge discovery 2.data mining methods 3.supervised learning...

Documents

data preprocessing slide

data stages

data present

raw data

data reduction procedures

introduction slide

numeric nonmissing data

learning paradigms