introduction to artificial intelligence unsupervised learning · clustering clustering is one of...

47
Introduction to Artificial Intelligence Unsupervised Learning Janyl Jumadinova October 21, 2016

Upload: others

Post on 29-Aug-2020

9 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

Introduction to Artificial IntelligenceUnsupervised Learning

Janyl JumadinovaOctober 21, 2016

Page 2: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

Supervised learning vs. Unsupervised learning

I Supervised learning: discover patterns in the data that relatedata attributes with a target (class) attribute.- These patterns are then utilized to predict the values of thetarget attribute in future data instances.

I Unsupervised learning: the data has no target attribute.- We want to explore the data to find some intrinsic structuresin them.

2/29

Page 3: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

Supervised learning vs. Unsupervised learning

I Supervised learning: discover patterns in the data that relatedata attributes with a target (class) attribute.- These patterns are then utilized to predict the values of thetarget attribute in future data instances.

I Unsupervised learning: the data has no target attribute.- We want to explore the data to find some intrinsic structuresin them.

2/29

Page 4: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

Clustering

I Organizing data into classes such that there is:- high intra-class similarity- low inter-class similarity

I Finding the class labels and the number of classes directly fromthe data (in contrast to classification).

I More informally, finding natural groupings among objects.

3/29

Page 5: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

Clustering

I Organizing data into classes such that there is:- high intra-class similarity- low inter-class similarity

I Finding the class labels and the number of classes directly fromthe data (in contrast to classification).

I More informally, finding natural groupings among objects.

3/29

Page 6: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

Clustering

I Organizing data into classes such that there is:- high intra-class similarity- low inter-class similarity

I Finding the class labels and the number of classes directly fromthe data (in contrast to classification).

I More informally, finding natural groupings among objects.

3/29

Page 7: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

Clustering

Clustering is one of the most utilized data mining techniques

It has a long history, and used in almost every field, e.g., medicine,psychology, botany, sociology, biology, archeology, marketing,insurance, libraries, etc.

I Ex.: : Given a collection of text documents, we want toorganize them according to their content similarities.

I Ex.: In marketing, segment customers according to theirsimilarities (to do targeted marketing).

4/29

Page 8: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

Clustering

Clustering is one of the most utilized data mining techniques

It has a long history, and used in almost every field, e.g., medicine,psychology, botany, sociology, biology, archeology, marketing,insurance, libraries, etc.

I Ex.: : Given a collection of text documents, we want toorganize them according to their content similarities.

I Ex.: In marketing, segment customers according to theirsimilarities (to do targeted marketing).

4/29

Page 9: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

Clustering

Clustering is one of the most utilized data mining techniques

It has a long history, and used in almost every field, e.g., medicine,psychology, botany, sociology, biology, archeology, marketing,insurance, libraries, etc.

I Ex.: : Given a collection of text documents, we want toorganize them according to their content similarities.

I Ex.: In marketing, segment customers according to theirsimilarities (to do targeted marketing).

4/29

Page 10: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

What is a natural grouping among these

objects?

5/29

Page 11: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

What is a natural grouping among these objects?

6/29

Page 12: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

What is Similarity?

The quality or state of being similar; likeness; resemblance; as, a similarity

of features. Webster’s Dictionary

Similarity is hard to define, but ... “We know it when we see it”

The real meaning of similarity is a philosophical question. We will take a more pragmatic

approach.

7/29

Page 13: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

What is Similarity?The quality or state of being similar; likeness; resemblance; as, a similarity

of features. Webster’s Dictionary

Similarity is hard to define, but ... “We know it when we see it”

The real meaning of similarity is a philosophical question. We will take a more pragmatic

approach.

7/29

Page 14: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

What is Similarity?The quality or state of being similar; likeness; resemblance; as, a similarity

of features. Webster’s Dictionary

Similarity is hard to define, but ... “We know it when we see it”

The real meaning of similarity is a philosophical question. We will take a more pragmatic

approach.

7/29

Page 15: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

What is Similarity?The quality or state of being similar; likeness; resemblance; as, a similarity

of features. Webster’s Dictionary

Similarity is hard to define, but ... “We know it when we see it”

The real meaning of similarity is a philosophical question. We will take a more pragmatic

approach.7/29

Page 16: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

Defining Distance Measures

Definition:

Let O1 and O2 be two objects from the universe of possible objects.The distance (dissimilarity) between O1 and O2 is a real numberdenoted by D(O1,O2).

8/29

Page 17: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

What properties should a distance measure

have?

I D(A,B) = D(B,A) SymmetryOtherwise you could claim “Greg looks like Oliver, but Oliver looks nothing like

Greg.”

I D(A,A) = 0 Constancy of Self-SimilarityOtherwise you could claim “Greg looks more like Oliver, than Oliver does.”

I D(A,B) = 0 iff A = B Positivity (Separation)Otherwise there are objects in your world that are different, but you cannot tell

apart.

I D(A,B) ≤ D(A,C ) + D(B,C ) Triangular InequalityOtherwise you could claim “Greg is very like Bob, and Greg is very like Oliver, but

Bob is very unlike Oliver.”

9/29

Page 18: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

What properties should a distance measure

have?

I D(A,B) = D(B,A) SymmetryOtherwise you could claim “Greg looks like Oliver, but Oliver looks nothing like

Greg.”

I D(A,A) = 0 Constancy of Self-SimilarityOtherwise you could claim “Greg looks more like Oliver, than Oliver does.”

I D(A,B) = 0 iff A = B Positivity (Separation)Otherwise there are objects in your world that are different, but you cannot tell

apart.

I D(A,B) ≤ D(A,C ) + D(B,C ) Triangular InequalityOtherwise you could claim “Greg is very like Bob, and Greg is very like Oliver, but

Bob is very unlike Oliver.”

9/29

Page 19: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

What properties should a distance measure

have?

I D(A,B) = D(B,A) SymmetryOtherwise you could claim “Greg looks like Oliver, but Oliver looks nothing like

Greg.”

I D(A,A) = 0 Constancy of Self-SimilarityOtherwise you could claim “Greg looks more like Oliver, than Oliver does.”

I D(A,B) = 0 iff A = B Positivity (Separation)Otherwise there are objects in your world that are different, but you cannot tell

apart.

I D(A,B) ≤ D(A,C ) + D(B,C ) Triangular InequalityOtherwise you could claim “Greg is very like Bob, and Greg is very like Oliver, but

Bob is very unlike Oliver.”

9/29

Page 20: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

What properties should a distance measure

have?

I D(A,B) = D(B,A) SymmetryOtherwise you could claim “Greg looks like Oliver, but Oliver looks nothing like

Greg.”

I D(A,A) = 0 Constancy of Self-SimilarityOtherwise you could claim “Greg looks more like Oliver, than Oliver does.”

I D(A,B) = 0 iff A = B Positivity (Separation)Otherwise there are objects in your world that are different, but you cannot tell

apart.

I D(A,B) ≤ D(A,C ) + D(B,C ) Triangular InequalityOtherwise you could claim “Greg is very like Bob, and Greg is very like Oliver, but

Bob is very unlike Oliver.” 9/29

Page 21: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

How do we measure similarity?

10/29

Page 22: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

How do we measure similarity?To measure the similarity between two objects, transform one of theobjects into the other, and measure how much effort it took. Themeasure of effort becomes the distance measure.

11/29

Page 23: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

How do we measure similarity?

12/29

Page 24: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

Partitional Clustering

I Non-hierarchical, each instance is placed in exactly one of Knonoverlapping clusters.

I Since only one set of clusters is output, the user normally has toinput the desired number of clusters K.

13/29

Page 25: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

Minimize Squared Error

14/29

Page 26: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

K-means clustering

I K-means is a partitional clustering algorithm.I The k-means algorithm partitions the given data into k clusters.

I Each cluster has a cluster center, called centroid.I k is specified by the user.

15/29

Page 27: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

K-means Algorithm

1. Decide on a value for k .

2. Initialize the k cluster centers (randomly, if necessary).

3. Decide the class memberships of the N objects by assigningthem to the nearest cluster center.

4. Re-estimate the k cluster centers, by assuming the membershipsfound above are correct.

5. If none of the N objects changed membership in the lastiteration, exit. Otherwise goto 3.

16/29

Page 28: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

K-means Algorithm

1. Decide on a value for k .

2. Initialize the k cluster centers (randomly, if necessary).

3. Decide the class memberships of the N objects by assigningthem to the nearest cluster center.

4. Re-estimate the k cluster centers, by assuming the membershipsfound above are correct.

5. If none of the N objects changed membership in the lastiteration, exit. Otherwise goto 3.

16/29

Page 29: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

K-means Algorithm

1. Decide on a value for k .

2. Initialize the k cluster centers (randomly, if necessary).

3. Decide the class memberships of the N objects by assigningthem to the nearest cluster center.

4. Re-estimate the k cluster centers, by assuming the membershipsfound above are correct.

5. If none of the N objects changed membership in the lastiteration, exit. Otherwise goto 3.

16/29

Page 30: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

K-means Algorithm

1. Decide on a value for k .

2. Initialize the k cluster centers (randomly, if necessary).

3. Decide the class memberships of the N objects by assigningthem to the nearest cluster center.

4. Re-estimate the k cluster centers, by assuming the membershipsfound above are correct.

5. If none of the N objects changed membership in the lastiteration, exit. Otherwise goto 3.

16/29

Page 31: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

K-means Algorithm

1. Decide on a value for k .

2. Initialize the k cluster centers (randomly, if necessary).

3. Decide the class memberships of the N objects by assigningthem to the nearest cluster center.

4. Re-estimate the k cluster centers, by assuming the membershipsfound above are correct.

5. If none of the N objects changed membership in the lastiteration, exit. Otherwise goto 3.

16/29

Page 32: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

K-Means Clustering: Step 1

17/29

Page 33: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

K-Means Clustering: Step 2

18/29

Page 34: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

K-Means Clustering: Step 3

19/29

Page 35: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

K-Means Clustering: Step 4

20/29

Page 36: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

K-Means Clustering: Step 5

21/29

Page 37: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

How can we tell the right number of clusters?

I In general, this is an unsolved problem.

I We can use approximation methods!

22/29

Page 38: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

How can we tell the right number of clusters?

I In general, this is an unsolved problem.

I We can use approximation methods!

22/29

Page 39: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

How can we tell the right number of clusters?

I In general, this is an unsolved problem.

I We can use approximation methods!

22/29

Page 40: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

23/29

Page 41: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

24/29

Page 42: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

25/29

Page 43: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

We can plot the objective function values for k = 1...6

I The abrupt change at k = 2, is highly suggestive of two clustersin the data.

I This technique for determining the number of clusters is knownas “knee finding” or “elbow finding”.

26/29

Page 44: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

Strengths of K-Means

I Simple: easy to understand and to implement

I Efficient: Time complexity O(tkn), where n is the number ofdata points, k is the number of clusters, and t is the number ofiterations.- Since both k and t are small, k-means is considered a linearalgorithm.

I Often terminates at a local optimum.- The global optimum may be found using techniques such as:deterministic annealing and genetic algorithms

27/29

Page 45: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

Weaknesses of K-Means

I The algorithm is only applicable if the mean is defined.- For categorical data - the centroid is represented by mostfrequent values.- The user needs to specify k.

I The algorithm is sensitive to outliers.- Outliers are data points that are very far away from other datapoints.- Outliers could be errors in the data recording or some specialdata points with very different values.

28/29

Page 46: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

Weaknesses of K-Means

I The algorithm is only applicable if the mean is defined.- For categorical data - the centroid is represented by mostfrequent values.- The user needs to specify k.

I The algorithm is sensitive to outliers.- Outliers are data points that are very far away from other datapoints.- Outliers could be errors in the data recording or some specialdata points with very different values.

28/29

Page 47: Introduction to Artificial Intelligence Unsupervised Learning · Clustering Clustering is one of the most utilized data mining techniques It has a long history, and used in almost

K-Means Summary

I Despite weaknesses, k-means is still the most popular algorithmdue to its simplicity, efficiency and other clustering algorithmshave their own lists of weaknesses.

I No clear evidence that any other clustering algorithm performsbetter in general, although they may be more suitable for somespecific types of data or applications.

I Comparing different clustering algorithms is a difficult task. Noone knows the correct clusters!

29/29