data mining - overviericher/dm/data_mining_1_overview.pdf · definition 1 data mining (minería de...
TRANSCRIPT
Data Mining - Overview
Dr. Jean-Michel RICHER
Dr. Jean-Michel RICHER Data Mining - Overview 1 / 27
Outline
1. Introduction
2. What will we cover ?
3. Skills required and to acquire
4. Evaluation
Dr. Jean-Michel RICHER Data Mining - Overview 2 / 27
1. Introduction
Dr. Jean-Michel RICHER Data Mining - Overview 3 / 27
Subject
Definition 1Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful informa-tion from data.
Definition 2Data Mining is the computing process of discovering pat-terns in large data sets involving methods at the intersec-tion of machine learning, statistics, and database systems.
Dr. Jean-Michel RICHER Data Mining - Overview 4 / 27
Subject
Definition 1Data Mining (Minería de Datos) is the extraction of im-plicit, previously unknown, and potentially useful informa-tion from data.
Definition 2Data Mining is the computing process of discovering pat-terns in large data sets involving methods at the intersec-tion of machine learning, statistics, and database systems.
Dr. Jean-Michel RICHER Data Mining - Overview 4 / 27
What do they have in common ?
Common termsdiscovery, extraction (not obvious)useful information, patterns, rules (form of the usefulinformation)data, database (potentially a huge amount of data)statistics, machine learning, ... (methods)
Dr. Jean-Michel RICHER Data Mining - Overview 5 / 27
What is data mining ?
DefinitionData Mining is a branch of computer science which is ei-ther
a knowledge discovery process of hidden or nontrivial information (ex: supermarket rules)or a knowledge synthesis process used to improve theunderstanding of raw data (ex: flower classification)
Dr. Jean-Michel RICHER Data Mining - Overview 6 / 27
Data definition
More generally
given a set of Objects X = { ~X1, . . . ,~Xn} (persons,
customers, patients, plants, ...)defined by a set of properties P = {P1, . . . ,Pk} (length,size, weight, color, ...)each Pi having discrete or continuous valueswe want to find
I patterns / rules than can explain the relationshipsbetween objects and their properties
I classify objects
Dr. Jean-Michel RICHER Data Mining - Overview 7 / 27
What are properties ?
A note on propertiesProperties can be
quantitative: a numerical valuequalitative: a text
ExampleFor temperatures:
quantitative and continuous: [−20◦C,+40◦C]
qualitative and discrete: very cold, cold, ..., hot, veryhot
Dr. Jean-Michel RICHER Data Mining - Overview 8 / 27
From data to knowledge
Objects P1 P2 ... Pk
~X1 x11 x2
1 ... xk1
...~Xn x1
n x2n ... xk
n
if (x1i > 10) and (x3
i == true) then x7i = humid
x1i = 1.23× x2
i + 2.56x3i
c1 = { ~X1,~X3},c2 = { ~X2,
~X4,~X5}
Dr. Jean-Michel RICHER Data Mining - Overview 9 / 27
Example (1/3)
Truth tableConsider the following truth table in logic (0 = false, 1 =true):
X Y Z f (X ,Y , Z)0 0 0 00 0 1 00 1 0 00 1 1 11 0 0 01 0 1 11 1 0 11 1 1 1
How can you summarize the function f (X ,Y , Z) ?
Dr. Jean-Michel RICHER Data Mining - Overview 10 / 27
Example (2/3)
Boolean expressionYou can compute the boolean expression of f (X ,Y , Z) as :
f (X ,Y , Z) = X .Y .Z + X .Y .Z + X .Y .Z + X .Y .Z
Majority functionIt is the majority function, such that f (X ,Y , Z) = 1 if thenumber of variables (X ,Y , Z) set to 1 is greater than thenumber of variables set to 0.
Dr. Jean-Michel RICHER Data Mining - Overview 11 / 27
Example (2/3)
Boolean expressionYou can compute the boolean expression of f (X ,Y , Z) as :
f (X ,Y , Z) = X .Y .Z + X .Y .Z + X .Y .Z + X .Y .Z
Majority functionIt is the majority function, such that f (X ,Y , Z) = 1 if thenumber of variables (X ,Y , Z) set to 1 is greater than thenumber of variables set to 0.
Dr. Jean-Michel RICHER Data Mining - Overview 11 / 27
Example (3/3)
Majority functionWith the majority function definition, you can extend thefunction with a greater number of input variables andkeep the same definition :
f (X ,Y , Z ,W )
f (X ,Y , Z ,W , T )
...
Dr. Jean-Michel RICHER Data Mining - Overview 12 / 27
2. What will we cover ?
Dr. Jean-Michel RICHER Data Mining - Overview 13 / 27
We will cover
Theoretical and practical aspects ofregressionclassificationclusteringdecision treeneural networks
Dr. Jean-Michel RICHER Data Mining - Overview 14 / 27
We will cover
RegressionPrinciple: try tocompute a property infunction of otherpropertiesexample: weight =f (height ,age, sex)
0
5
10
15
20
0 1 2 3 4 5 6 7 8
y
x
measureslaw 2*x+3
Weka LR
Dr. Jean-Michel RICHER Data Mining - Overview 15 / 27
We will cover
0
5
10
15
20
0 1 2 3 4 5 6 7 8
y
x
measureslaw 2*x+3
Weka LR
Dr. Jean-Michel RICHER Data Mining - Overview 15 / 27
We will cover
ClassificationPrinciple: try to splitdata into groups usingthe properties of objets,knowing the number ofgroupspart of supervisedlearning
Dr. Jean-Michel RICHER Data Mining - Overview 16 / 27
We will cover
ClusteringPrinciple: try to create groups of objects based ontheir similarities, the number of groups is unknownpart of unsupervised learning
Dr. Jean-Michel RICHER Data Mining - Overview 17 / 27
We will cover
Decision treePrinciple: create amodel that predicts thevalue of a targetvariable by learningsimple decision rulesinferred from the datafeaturesa non-parametricsupervised learningmethod used forclassification andregression
Dr. Jean-Michel RICHER Data Mining - Overview 18 / 27
We will cover
Dr. Jean-Michel RICHER Data Mining - Overview 18 / 27
We will cover
Neural Networksalso called ArtificialNeural Networks (ANNs)or ConnectionistSystemsPrinciple: a computingdevice inspired bybiological neuralnetworks that has theability to progressivelylearn by consideringexamples (train set)kind of a black-box
Dr. Jean-Michel RICHER Data Mining - Overview 19 / 27
Resources
SoftwareWe will use the following software:
WEKA Waikato Environment for Knowledge Analysis,http://www.cs.waikato.ac.nz/ml/weka
R (rio, dplyr, ggplot2)Python (numpy, scipy, sklearn, matplotlib)
Dr. Jean-Michel RICHER Data Mining - Overview 20 / 27
3. Skills
Dr. Jean-Michel RICHER Data Mining - Overview 21 / 27
Required skills
Skills you need to haveBefore entering the course you should have
mathematical background (matrix, statistics,probabilities)computer science background (Java, R or Python)
Dr. Jean-Michel RICHER Data Mining - Overview 22 / 27
Skills to acquire
Skills to acquireAfter this course you shouldbe able to
give a definition and explain what is data miningdefine the different kind of analysis / analytics of DataMiningchose the correct method / algorithm to produce ananalysis in function of the data and the question youhave to answer
Dr. Jean-Michel RICHER Data Mining - Overview 23 / 27
3. End
Dr. Jean-Michel RICHER Data Mining - Overview 24 / 27
University of Angers - Faculty of Sciences
UA - Angers2 Boulevard Lavoisier 49045 AngersCedex 01Tel: (+33) (0)2-41-73-50-72
Dr. Jean-Michel RICHER Data Mining - Overview 25 / 27