introduction to datamining concept and techniques
TRANSCRIPT
Schedule:1. Example of Datamining2. What and Where is Datamining in the System3. Datamining Techniques
Data preprocessing Data Analysis Data Visualization
How data look like?
X Y
3 33 12 24 62 36 77 55 6
Can we get some thing from this?
The row represents an object and its columns represent its attributes
Ex: can we identify the group of these objects? YES
1. Example of Datamining
Now, forget the table, consider a row as a point then we have
1 2 3 4 5 6 7 8012345678
X
Y
BA
C
From each data point, we find its neighbors by scanning with a radius r . For Example : A will have 2 Neighbors B and C , denoted: A{B,C}
r
D
A and D have same neighbors so they are considered as neighbors
Same for B {A,B,C,D} ,C{A,B,C,D}, D{B,C}
The points have neighborhood will be in the same group.
1. Example of Datamining
Finally we have 2 groups after considering all points
1 2 3 4 5 6 7 8012345678
X
Y
What do we see here?
Data has not been classified into groups but we now have the groups
This is just an example of technique called CLUSTERING in DATAMINING
1. Example of Datamining
2. What and Where is Datamining in the System
So. What exactly is Datamining?
Datamining is the set of tools and techniques to retrieve hidden Knowledge/Rules from data
The name of datamining could make us to misunderstand
Data was there, we do not need to ‘mining’ it
For ore mining you need hammers and shovels
However, for datamining you need mathematic, statistic and probability, machine learning, computer programming, database techniques,...
2. What and Where is Datamining in the System
Where is Datamining in the system?
Employee/Staff
Day by day, The staff using the software (Web/
Desktop/Mobile application) to generate data by recording
all of his/her business activities (customers, products,
order detail, contracts ,…) Database
Data is added to Database
Online transaction processing (OLTP)
Database
Database
….
Data from several data sources (OLTP) will be collected to a common repository
Data warehouse
Integration Service
Datamining service will access to the Data warehouse to process
Data Mining
3. Datamining Techniques
What are the techniques in Datamining?
There are so many techniques can be applied in datamining
Basically we can classify them into 3 groups / phases
Data-Preprocessing
Data Analysis
Data Presentation
3. Datamining Techniques
We can understand that:The quality of collected data would be not good. It is necessary to clean / format / transform .... Before analyzing
This is very important process. It is very hard to find an abstract way to describe.
Data-Preprocessing
Here we will see few examples of data pre-processing techniques:
• Similarity Measure
• Down Sampling• Dimension Reduction• Vectorization
3. Datamining Techniques
How can we know which object are similar?
Data-Preprocessing Similarity Measure
A(x1,y1)
B(x2,y2)
C(x1,y1)D2D1
Measure the distance between AB and AC
We see that D1 < D2 -> A is more similar with B than C
Every point can be represented as vector. Measure the angle between pair of vectors: A and B, then A and C
We see that < -> A is more similar with B than C
𝜶
𝜷
3. Datamining Techniques
What if, you have so many data, performing data analysis on all of them may be not necessary and reducing performance ?
Data-Preprocessing Down Sampling
Just pick some of them to evaluate
Example: using a cell-size of . Keep only object / cell
𝑔
𝑔Origin Data Down Sampling
3. Datamining Techniques
All example data have been presented to you are in 2 dimensions, 2 attributes (X,Y) . What if it was ~10.000 attributes for each object
Data-Preprocessing Dimension Reduction
This could reduce the performance (and or accuracy) of data-analysis algorithms . Somehow we need to reduce number of dimensions
Principal component Analysis & Singular value Decomposition are 2 of most effective methods to do this
3. Datamining Techniques
Data-Preprocessing Dimension Reduction - PCA
PCA
X
Y𝑃1
𝑃2
Origin Data Data projected to Principal Components
We Only keep Principal Components that have highest eigenvalues. On above example. We can let then keep instead of both ,
By this way the number of dimensions has been reduced
3. Datamining Techniques
Data-Preprocessing Vectorization
Most of Data Analysis algorithms consider the input as set of vectors, so we need to transform the collected data into set of vectors.
Ex: Giving a document: “Mr A has not passed the exam this year. He will do it again next year”
Some of important words will be extracted like “Mr A” , “not” , “pass” ,”exam” , “again” , “next” , “year”
Measure the frequency of each word, we get the vector that represent the document
Mr A not pass exam again next year
1 1 1 1 1 1 2
3. Datamining Techniques
There are so many techniques in this phase:
• Clustering
• Classification
• Regression
• Rule Bases
• ….
This is the most important phase, where we find all of hidden knowledge/ rules in the data
Data Analysis
3. Datamining Techniques
The process of clustering is to find ways to group objects into groups (clusters)
Data Analysis Clustering
The objects in the same cluster are similar and otherwise they are not similar.
There are 2 types of clustering : Partional & Hierarchical
In this presentation: we see an example of the most famous clustering method : K-Mean
3. Datamining Techniques
Data Analysis Clustering – K mean Algorithm
1. Randomly select K center (centroid) for K clusters (cluster).
2. Calculate the distance between objects (objects) to the K center
3. Group objects to the nearest group
4. Defining the new focus for the group
5. Repeat step 2 until no change of subject groups
3. Datamining Techniques
Data Analysis Clustering – K mean Algorithm
Consider the below data
Plot them we have:
3. Datamining Techniques
Data Analysis Clustering – K mean Algorithm
Select K=2 centroids Compute the new position of centroids
Finally centroids stop changing
The object belongs to the group of its closest centroid
The key point of algorithm is to select a good k
3. Datamining Techniques
Data Analysis Classification
How can we identify the group of unclassified object ?
Sure! we can perform clustering to do this.
However, what if we know some of classified objects in the past? Can we do better than Clustering? YES.
We can construct a prediction model to predict the group of unclassified objects based on the classified objects
This process called CLASSIFICATION
3. Datamining Techniques
Data Analysis Classification
The process of Classification can be described as below
Learning Algorithm
Model
3. Datamining Techniques
Data Analysis Classification - SVM
Support Vector Machine (SVM) is one of famous classification method. It belongs to group of linear classifiersFor example: data classified in red and blue Training Data
: normal vector
: bias / distance from the line to origin
?
Classification Model ?
3. Datamining Techniques
Data Analysis Regression
Use for prediction: but to predict the missing value of an attributeFor example:
Y
X𝑥𝑖
𝑦 𝑖
• How to find , if known?
• We can estimate the line that describe The data
• Plug to line equation toFind
• This is just an example ofLinear Regression
3. Datamining Techniques
Data Analysis Rule Base
Rule Base techniques : to find hidden patterns in the data
Example of rule base techniques:
• Customer normally buy rice always buy vegetable
• Young people want to more expensive phone than others
• People always buy laptop before buying cell-phone
Frequent Pattern
Gradual Pattern
Sequential Pattern
3. Datamining Techniques
Data Visualization
Techniques to present knowledge that you retrieved to user
Categ
ory
1
Categ
ory
30
4
8
12
Series 3Series 2Series 1
Series 1 Series 2 Series 3
Category 1 4.3 2.4 2
Category 2 2.5 4.4 2
Category 3 3.5 1.8 3
Category 4 4.5 2.8 5