cs5604( midterm presentation) – october 13, 2010 virginia polytechnic institute and state...

52
CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif) CLUTO

Upload: maurice-harvey

Post on 17-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

CS5604( Midterm Presentation) – October 13, 2010

Virginia Polytechnic Institute and State University

Presented by: Team 4(Sarosh, Sony, Sherif)

CLUTO

Page 2: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

What is clustering? What is CLUTO? History of CLUTO CLUTO Schematic Diagram Application areas of CLUTO Features of CLUTO Relation to IR Concepts Input file formats in CLUTO Output file formats in CLUTO Concept Map Demo Resources Other Features Questions

OUTLINE

Page 3: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Clustering algorithms group a set of documents into subsets or clusters.

Documents within a cluster should be as similar as possible. Documents in one cluster should be as dissimilar as possible from

documents in other clusters. Clustering can be classified into:

Flat Clustering and Hierarchical Clustering Hard Clustering and Soft Clustering

What is Clustering?

Page 4: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

What is clustering? What is CLUTO? History of CLUTO CLUTO Schematic Diagram Application areas of CLUTO Features of CLUTO Relation to IR Concepts Input file formats in CLUTO Output file formats in CLUTO Concept Map Demo Resources Other Features Questions

OUTLINE

Page 5: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

CLUTO is a software package for clustering low and high dimensional datasets and for analyzing the characteristics of the various clusters.

CLUTO provides three different classes of clustering algorithms that operate either directly in the object’s feature space or in the similarity space.

Algorithms are based on the partitional, agglomerative, and graph partitioning paradigms.

CLUTO provides a total of seven different criterion functions. CLUTO provides tools for analyzing the discovered clusters to

understand the relations between the objects assigned to each cluster and the relations between the different clusters

What is CLUTO?

Page 6: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

What is clustering? What is CLUTO? History of CLUTO CLUTO Schematic Diagram Application areas of CLUTO Features of CLUTO Relation to IR Concepts Input file formats in CLUTO Output file formats in CLUTO Concept Map Demo Resources Other Features Questions

OUTLINE

Page 7: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

CLUTO was developed at Karypis lab, University Of Minnesota Twin Cities.

Ver: 1.5- Added the features of agglomerative clustering algorithms, cluster visualization capability, dense input file support.

Ver 2.0- New clustering programs called scluster and vcluster, added graph-partitioning based clustering algorithms.

Ver 2.1- Added an agglomerative algorithm that uses partitional-clustering to bias the agglomeration.

Ver 2.1.1- Reduced the memory requirements of the rb-based clustering methods.

Ver 2.1.2- Experimental support for multi-core processors and SMPs using OpenMP for MS Windows and Linux-i686

Ver 2.1.2a- Included build for Windows X86_64.

History Of CLUTO

Page 8: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

What is clustering? What is CLUTO? History of CLUTO CLUTO Schematic Diagram Application areas of CLUTO Features of CLUTO Relation to IR Concepts Input file formats in CLUTO Output file formats in CLUTO Concept Map Demo Resources Other Features Questions

OUTLINE

Page 9: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

CLUTO Schematic Diagram

scluster

vcluster

Clustering Algorithms

SimilarityFunction

Criterion Function

Graph file

Matrix File

Cluster solution file

Tree file

Row label file

Row class label file

column label file

Page 10: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

What is clustering? What is CLUTO? History of CLUTO CLUTO Schematic Diagram Application areas of CLUTO Features of CLUTO Relation to IR Concepts Input file formats in CLUTO Output file formats in CLUTO Concept Map Demo Resources Other Features Questions

OUTLINE

Page 11: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Digital Libraries - To Cluster documents (objects) based on the terms (dimensions) they contain.

Customer Services - Amazon.com may group customers (objects) based on the types of products (nooks, music products - dimensions) they purchase etc.

Genetics - To cluster genes (objects) based on their expression levels (dimensions)

Biochemistry - To cluster proteins (objects) based on the motifs (dimensions) they contain.

Application areas of CLUTO

Page 12: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

What is clustering? What is CLUTO? History of CLUTO CLUTO Schematic Diagram Application areas of CLUTO Features of CLUTO Relation to IR Concepts Input file formats in CLUTO Output file formats in CLUTO Concept Map Demo Resources Other Features Questions

OUTLINE

Page 13: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Multiple classes of clustering algorithms: partitional, agglomerative, & graph-partitioning based.

Multiple similarity/distance functions: Euclidean distance, cosine, correlation coefficient, extended Jaccard, user-defined.

Numerous novel clustering criterion functions and agglomerative merging schemes.

Traditional agglomerative merging schemes: single-link, complete-link, UPGMA

Features of CLUTO

Page 14: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Extensive cluster visualization capabilities and output options: postscript, SVG, gif, xfig, etc.

Multiple methods for effectively summarizing the clusters: most descriptive and discriminating dimensions, cliques, and frequent item sets.

Can scale to very large datasets containing hundreds of thousands of objects and tens of thousands of dimensions.

CLUTO provides access to its various clustering and analysis algorithms via the vcluster and scluster stand-alone programs.

Vcluster takes as input the actual multi-dimensional representation of the objects that need to be clustered.

Scluster takes as input the similarity matrix (or graph) between these objects.

Their overall calling sequence is as follows:◦ vcluster [optional parameters] MatrixFile NClusters◦ scluster [optional parameters] GraphFile Nclusters

Page 15: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

What is clustering? What is CLUTO? History of CLUTO CLUTO Schematic Diagram Application areas of CLUTO Features of CLUTO Relation to IR Concepts Input file formats in CLUTO Output file formats in CLUTO Concept Map Demo Resources Other Features Questions

OUTLINE

Page 16: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Chapter Clustering concept

Chapter 3 Jaccard Coefficient

Chapter 6 Cosine similarity measure,Euclidean distance

Chapter 7 Cluster pruning

Chapter 16 Flat Clustering

Chapter 17 Hierarchical Clustering

Relation to IR Concepts

Page 17: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

What is clustering? What is CLUTO? History of CLUTO CLUTO Schematic Diagram Application areas of CLUTO Features of CLUTO Relation to IR Concepts Input file formats in CLUTO Output file formats in CLUTO Concept Map Demo Resources Other Features Questions

OUTLINE

Page 18: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Matrix Format: This is the primary input for CLUTO’s vcluster program. Each row of this matrix represent a single object Columns correspond to the dimensions (i.e., features) of the objects. Matrix format can be sparse or dense

Input File formats in CLUTO

Page 19: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Dense Matrix format:

The first line of the matrix file contains exactly two numbers, all of which are integers. The first integer is the number of rows in the matrix (n) and the second integer is the number of columns in the matrix (m).

Each line contains exactly m space-separated floating point values, such that the ith value corresponds to the ith column of A.

Input File formats in CLUTO

Page 20: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Number of columnsNumber of rows

Page 21: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Sparse matrix format

The first line contains information about the size of the matrix, while the remaining n lines contain information for each row of A. In CLUTO’s sparse matrix format only the non-zero entries of the matrix are stored.

The first line of the matrix file contains exactly three numbers, all of which are integers.

The first integer is the number of rows in the matrix (n), the second integer is the number of columns in the matrix (m), and the third integer is the total number of non-zeros entries in the n × m matrix

Page 22: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Graph Files:

This is the primary input for CLUTO’s vcluster program. It is a square matrix.

It specifies the similarity between the objects to be clustered. A value at the (i, j ) location of this matrix indicates the similarity

between the ith and the jth object.

Page 23: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Sparse Graph Format:

The first line of the file contains exactly two numbers, all of which are integers. The first integer is the number of vertices in the graph (n) and the second integer is the number of edges in the graph.

The (i + 1)st line of the file contains information about the adjacency structure of the ith vertex.

The adjacency structure of each vertex is specified as a space-separated list of pairs. Each pair contains the number of the adjacent vertex followed by the similarity of the corresponding edge.

Input File Formats in CLUTO

Page 24: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Number of edgesNumber of vertices

Page 25: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Dense Graph Format:

The first line of the file contains exactly one number, which is the number of vertices n of the graph.

Each line contains exactly n space-separated floating point values, such that the ith value corresponds to the similarity to the ith vertex of the graph.

Page 26: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

What is clustering? What is CLUTO? History of CLUTO CLUTO Schematic Diagram Application areas of CLUTO Features of CLUTO Relation to IR Concepts Input file formats in CLUTO Output file formats in CLUTO Concept Map Demo Resources Other Features Questions

OUTLINE

Page 27: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Output file format in CLUTO

Clustering Solution File◦ The clustering file of a matrix with n rows consists of n lines with a

single number per line. The ith line of the file contains the cluster number that the ith object/row/vertex belongs to. Cluster numbers run

from zero to the number of clusters minus one.◦ Eg.

Tree File◦ The tree produced by performing a hierarchical agglomerative clustering

on top of the k-way clustering solution produced by vcluster is stored in a file in the form of a parent array.

◦ The ith line contains the parent of the ith node of the tree.

Microsoft Office Word Document

Page 28: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

What is clustering? What is CLUTO? History of CLUTO CLUTO Schematic Diagram Application areas of CLUTO Features of CLUTO Relation to IR Concepts Input file formats in CLUTO Output file formats in CLUTO Concept Map Demo Resources Other Features Questions

OUTLINE

Page 29: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Concept Map

Page 30: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

What is clustering? What is CLUTO? History of CLUTO CLUTO Schematic Diagram Application areas of CLUTO Features of CLUTO Relation to IR Concepts Input file formats in CLUTO Output file formats in CLUTO Concept Map Demo Resources Other Features Questions

OUTLINE

Page 31: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Demo

Page 32: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Demo

Page 33: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)
Page 34: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)
Page 35: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)
Page 36: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)
Page 37: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)
Page 38: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)
Page 39: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)
Page 40: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)
Page 41: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)
Page 42: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)
Page 43: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)
Page 44: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Using WinSCP

Page 45: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

What is clustering? What is CLUTO? History of CLUTO CLUTO Schematic Diagram Application areas of CLUTO Features of CLUTO Relation to IR Concepts Input file formats in CLUTO Output file formats in CLUTO Concept Map Demo Resources Other Features Questions

OUTLINE

Page 46: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Resources Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. CURE: An efficient clustering

algorithm for large databases. In Proc. Of 1998 ACM-SIGMOD Int. Conf. on Management of Data, 1998.

G. Karypis, E.H. Han, and V. Kumar. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):68–75, 1999.

G. Karypis and V. Kumar. hMETIS 1.5: A hypergraph partitioning package. Technical report, Department of Computer Science, University of Minnesota, 1998. Available on the WWW at URL http://www.cs.umn.edu/˜metis.

G. Karypis and V. Kumar. METIS 4.0: Unstructured graph partitioning and sparse matrix ordering system. Technical report, Department of Computer Science, University of Minnesota, 1998. Available on the WWW at URL http://www.cs.umn.edu/˜metis.

Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In CIKM, 2002.

Ying Zhao and George Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report TR #01–40, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2001. Available on the WWW at http://cs.umn.edu/˜karypis/publications.

Page 47: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

What is clustering? What is CLUTO? History of CLUTO CLUTO Schematic Diagram Application areas of CLUTO Features of CLUTO Relation to IR Concepts Input file formats in CLUTO Output file formats in CLUTO Concept Map Demo Resources Other Features Questions

OUTLINE

Page 48: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Other Features gCLUTO

◦ is a cross-platform graphical application for clustering low- and high-dimensional datasets and for analyzing the characteristics of the various clusters.

◦ gCLUTO provides tools for visualizing the resulting clustering solutions using tree, matrix, and an OpenGL-based mountain visualization.

Page 49: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

gCLUTO

Page 50: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

gCLUTO

Page 51: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

wCLUTO ◦ Is a web-enabled data clustering application that

is designed for the clustering and data-analysis requirements of gene-expression analysis.

◦ Users can upload their datasets, select from a number of clustering methods, perform the analysis on the server, and visualize the final results.

◦ The wCLUTO web-server is hosted by the Center of Computational Genomics and Bioinformatics at the University of Minnesota.

Page 52: CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

Questions ?