-
7/29/2019 document clustering based on topic maps using k-modes algorithm
1/15
DOCUMENT CLUSTERINGBASED ON TOPIC MAPS USING
K-MODES ALGORITHM
Project Members:M.Surya Pavani - 09241A0547
D.Anusha - 09241A0504
E.Divya Sree - 10241A0506
K.Padma Sri - 10241A0508
Project Guide:Dr. K. Anuradha(P. hd)Head OfDepartment,CSE
-
7/29/2019 document clustering based on topic maps using k-modes algorithm
2/15
ABSTRACTThe process of grouping a set of physical or abstract objects into classes
of similar objects is called clustering.
Importance of document clustering is now widely acknowledged byresearchers for better management, smart navigation, efficientfiltering, and concise summarization of large collection of documents
like World Wide Web (WWW).
Topic Maps is a standard for the representation and interchange ofknowledge, with an emphasis on the findability of information.
The k-Modes algorithm uses similarity measures to calculate thesimilarity between the documents and uses a Frequency based methodto update modes in the clustering process to minimize the clusteringcost function.
-
7/29/2019 document clustering based on topic maps using k-modes algorithm
3/15
REQUIREMENTS
Hardware requirements :Memory : 256 MBDisk space : 200 MB
Operating system : Any OS supporting Java
Software Requirements:Java JRE 6 or laterEclipse IDEWandora Software
-
7/29/2019 document clustering based on topic maps using k-modes algorithm
4/15
WHAT IS DATA MINING ?
Non-trivial extraction of implicit, previously unknown andpotentially useful information from data.
Major Components of Data MiningAssociationRule Mining
Classification and
Prediction
-
7/29/2019 document clustering based on topic maps using k-modes algorithm
5/15
CLUSTERING DEFINITION
Given a set of data points, each having a set of attributesand a similarity measure among them, find clusters such that
Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to oneanother.
Similarity Measures:
Euclidean Distance, Manhattan Distance, Jaccard
-
7/29/2019 document clustering based on topic maps using k-modes algorithm
6/15
DOCUMENT CLUSTERING
Data clustering is an unsupervised technique for discoveringvaluable knowledge from data.
Document clustering is a specialized data clustering problem,
where the objects are in the form of documents.
The objective of the clustering process is to group the documentswhich are similar in some sense like: type of document, contents ofdocument, etc into a single group (cluster).
Exploring , analyzing and correctly classifying the unknownnatures of data in a document without supervision is the majorrequirement of document clustering method.
-
7/29/2019 document clustering based on topic maps using k-modes algorithm
7/15
STEPS IN TOPIC MAPS BASED
DOCUMENT CLUSTERING APPROACH
-
7/29/2019 document clustering based on topic maps using k-modes algorithm
8/15
THREE BASIC STEPS INVOLVED
IN DOCUMENT CLUSTERINGSTEP 1:
First transform documents in a compact form which only representsthe topics presented in the documents .
The topic maps information is generated by using an online opensource application Wandora
The topic maps where then exported into XTM format using theWandoras export Utility.
-
7/29/2019 document clustering based on topic maps using k-modes algorithm
9/15
SCREEN SHOT OF TOPIC MAPS
-
7/29/2019 document clustering based on topic maps using k-modes algorithm
10/15
STEP 2:
A Base name similarity matrix is developed usingWandoras Export utility..
-
7/29/2019 document clustering based on topic maps using k-modes algorithm
11/15
STEP 3: Finally the K-Mode clustering is applied to obtain the final
clusters
K-MODES CLUSTERINGALGORITHM
The K-modes algorithm uses a simple matching similarity
measureto deal with numerical objects, replaces the means ofclusters with modes, and uses a frequency-based method to
update modes in the clustering process to minimize the clusteringcost function
-
7/29/2019 document clustering based on topic maps using k-modes algorithm
12/15
ALGORITHM
Step 1: Select K initial modes, one for each cluster.
Step 2:Allocate an Object to the Cluster whose mode isnearest to it according to Similarity measure.
Step 3:After all objects have been allocated to clusters,retest the similarity of objects against the current modes. Ifan object is found such that its nearest mode belongs toanother cluster rather than its current one, reallocate the
object to that cluster.
Step 4:After allocating the objects to the new clustercalculate the mode using frequency based method.
Step 5: Repeat 3, 4 until there is no change in the mode.
-
7/29/2019 document clustering based on topic maps using k-modes algorithm
13/15
-
7/29/2019 document clustering based on topic maps using k-modes algorithm
14/15
REFERENCES
Abstract : International Journal of Computer Applications(0975 8887) Volume 12 No.1, December 2010Data Mining Concepts : Micheline Kramber
-
7/29/2019 document clustering based on topic maps using k-modes algorithm
15/15
THANK YOU