data mining and machine learning a brief introduction

25
Data mining and machine learning A brief introduction

Upload: frederick-greer

Post on 27-Dec-2015

230 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Data mining and machine learning A brief introduction

Data mining and machine learning

A brief introduction

Page 2: Data mining and machine learning A brief introduction

Outline A brief introduction to learning algorithms

Classification algorithms Clustering algorithms

Addressing privacy issues in learning Single dataset publishing Distributed multiple datasets How data is partitioned

Page 3: Data mining and machine learning A brief introduction

A quick review

Machine learning algorithms Supervised learning (classification)

Training data have class labels Find the boundary between classes

Unsupervised learning (clustering) Training data have no labels Similarity measure is the key Grouping records based on the similarity

measure

Page 4: Data mining and machine learning A brief introduction

A quick review

Good tutorials http://www.cs.utexas.edu/~mooney/cs39

1L/ “Top 10 data mining algorithms”

www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf

We will review the basic ideas of some algorithms

Page 5: Data mining and machine learning A brief introduction

C4.5 decision tree (classification)

Based on ID3 algorithm Convert decision tree to rule set

From the root to a leave a rule

Prune the rules Cross validation

Split data to N folds

training validating testingIn each round

For choosing the best parameters

Testing the generalization power

Final result: the average of N testing results

Page 6: Data mining and machine learning A brief introduction

Naïve bayes (classification)

Two classes: 0/1, feature vector: x (x1,x2,…, xn)

Apply bayes rule:

Assume independentfeatures :

Easy to count f(xi|class label) with the training data

Page 7: Data mining and machine learning A brief introduction

K nearest neighbor (classification)

“instance-based learning”

Classifying the point

Decision area: Dz

More general: kernel methods

Page 8: Data mining and machine learning A brief introduction

Linear classifier (classification)

wTx + b = 0

wTx + b < 0wTx + b > 0

f(x) = sign(wTx + b)

Examples:•Perceptron•Linear discriminant analysis(LDA)

Page 9: Data mining and machine learning A brief introduction

There are infinite number of linear separatorsWhich one is optimal?

Page 10: Data mining and machine learning A brief introduction

Support Vector Machine (classification)

Distance from example xi to the separator is

Examples closest to the hyperplane are support vectors. Margin ρ of the separator is the distance between support

vectors.

w

xw br i

T

r

ρ Maximizing:

Extended to handle:1. Nonlinear2. Noisy margin3. Large datasets

Page 11: Data mining and machine learning A brief introduction

Boosting (classification)

Classifier ensembles Average prediction of a set of classifiers

trained on the same set of data H(x) = sum hi (x)

Weighting learning examples for a new classifier hi(x) based on previous classifiers Emphasis on incorrectly predicted examples

Intuition Sample weighting Averaging can reduce the variance of prediction

Page 12: Data mining and machine learning A brief introduction

AdaBoost Freund Y, Schapire RE (1997) A decision-theoretic

generalization of on-line learning and an application to boosting. J Comput Syst Sci

Page 13: Data mining and machine learning A brief introduction

Gradient boosting J. Friedman: stochastic gradient boosting,

http://citeseer.ist.psu.edu/old/126259.html

Page 14: Data mining and machine learning A brief introduction

Clustering

Definition of similarity measures Point-wise

Euclidean Cosine ( document similarity) Correlation …

Set-wise Min/max distance between two sets Entropy based (categorical data)

Page 15: Data mining and machine learning A brief introduction

Types of clustering algorithm Hierarchical

1. Merging most similar pairs each step2. Until reaching desired number of clusters

Partitioning (k-means)1. Set initial centroids 2. Partition the data3. Adjust the centroids4. Iterate on 2 and 3 until converging

Other classification of algorithms Aglommerative (bottom-up) methods Divisive (partitional, top-down)

Page 16: Data mining and machine learning A brief introduction

Challenges in Clustering

Efficiency of the algorithm –large datasets Linear-cost algorithms: k-means However, the costs of many algorithms

are quadratic Perform a three-phase processing

1. Sampling2. Clustering3. Labeling

Page 17: Data mining and machine learning A brief introduction

Challenges in Clustering

Irregularly shaped clusters and noises

Page 18: Data mining and machine learning A brief introduction

Sample clustering algorithms Typical ones

Kmeans Expectation-Maximization (EM)

A lot of clustering algorithms addressing different challenges Good survey:

AK Jain etc. Data Clustering: A Review, ACM Computing Surveys, 1999

Page 19: Data mining and machine learning A brief introduction

Kmeans illustration

Randomly select centroids Assign cluster label of each point

according to the distance to the centroids

Page 20: Data mining and machine learning A brief introduction

kmeans

Recalculate the centroids Reclustering

Repeat, until the cluster labels do not change, or the changes of centroids are very small

Page 21: Data mining and machine learning A brief introduction

PPDM issues

How data is collected Single party releases data Multiparty collaboratively mining data

Pooling data Cryptographic protocols

How data is partitioned Horizontally vertically

Page 22: Data mining and machine learning A brief introduction

Single party

Data perturbation Rakesh00, for decision tree Chen05, for many classifiers and

clustering algorithms

Anonymization Top-down/bottom-up: decision tree

Page 23: Data mining and machine learning A brief introduction

Multiple parties

Party 1

data

Party 2

data

Party n

dataserver

data

user 1 user 1 user 1

Perturbeddata

network

Service-based computing Peer-to-peer computing

•Perturbation & anonymization•Papers: 89,92,94,185,

•Cryptographic approaches•Papers: 95-99,104,107,108

Page 24: Data mining and machine learning A brief introduction

How data is partitioned Horizontally partitioned

All additive (and some multiplicative) perturbation methods

Protocols Kmeans, svm, naïve bayes, bayesian network…

Vertically partitioned All additive perturbation methods Protocols

Kmeans, bayesian network…

Page 25: Data mining and machine learning A brief introduction

Challenges and opportunities

Many modeling methods have no privacy-preserving version Cost of protocol based approaches Limitation of column-based additive

perturbation Complexity

PP Methods that can be applied to a class of DM algorithms E.g., geometric perturbation