document clustering based on topic maps using k-modes algorithm

Upload: pavani-manthena

Post on 04-Apr-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 document clustering based on topic maps using k-modes algorithm

    1/15

    DOCUMENT CLUSTERINGBASED ON TOPIC MAPS USING

    K-MODES ALGORITHM

    Project Members:M.Surya Pavani - 09241A0547

    D.Anusha - 09241A0504

    E.Divya Sree - 10241A0506

    K.Padma Sri - 10241A0508

    Project Guide:Dr. K. Anuradha(P. hd)Head OfDepartment,CSE

  • 7/29/2019 document clustering based on topic maps using k-modes algorithm

    2/15

    ABSTRACTThe process of grouping a set of physical or abstract objects into classes

    of similar objects is called clustering.

    Importance of document clustering is now widely acknowledged byresearchers for better management, smart navigation, efficientfiltering, and concise summarization of large collection of documents

    like World Wide Web (WWW).

    Topic Maps is a standard for the representation and interchange ofknowledge, with an emphasis on the findability of information.

    The k-Modes algorithm uses similarity measures to calculate thesimilarity between the documents and uses a Frequency based methodto update modes in the clustering process to minimize the clusteringcost function.

  • 7/29/2019 document clustering based on topic maps using k-modes algorithm

    3/15

    REQUIREMENTS

    Hardware requirements :Memory : 256 MBDisk space : 200 MB

    Operating system : Any OS supporting Java

    Software Requirements:Java JRE 6 or laterEclipse IDEWandora Software

  • 7/29/2019 document clustering based on topic maps using k-modes algorithm

    4/15

    WHAT IS DATA MINING ?

    Non-trivial extraction of implicit, previously unknown andpotentially useful information from data.

    Major Components of Data MiningAssociationRule Mining

    Classification and

    Prediction

  • 7/29/2019 document clustering based on topic maps using k-modes algorithm

    5/15

    CLUSTERING DEFINITION

    Given a set of data points, each having a set of attributesand a similarity measure among them, find clusters such that

    Data points in one cluster are more similar to one another.

    Data points in separate clusters are less similar to oneanother.

    Similarity Measures:

    Euclidean Distance, Manhattan Distance, Jaccard

  • 7/29/2019 document clustering based on topic maps using k-modes algorithm

    6/15

    DOCUMENT CLUSTERING

    Data clustering is an unsupervised technique for discoveringvaluable knowledge from data.

    Document clustering is a specialized data clustering problem,

    where the objects are in the form of documents.

    The objective of the clustering process is to group the documentswhich are similar in some sense like: type of document, contents ofdocument, etc into a single group (cluster).

    Exploring , analyzing and correctly classifying the unknownnatures of data in a document without supervision is the majorrequirement of document clustering method.

  • 7/29/2019 document clustering based on topic maps using k-modes algorithm

    7/15

    STEPS IN TOPIC MAPS BASED

    DOCUMENT CLUSTERING APPROACH

  • 7/29/2019 document clustering based on topic maps using k-modes algorithm

    8/15

    THREE BASIC STEPS INVOLVED

    IN DOCUMENT CLUSTERINGSTEP 1:

    First transform documents in a compact form which only representsthe topics presented in the documents .

    The topic maps information is generated by using an online opensource application Wandora

    The topic maps where then exported into XTM format using theWandoras export Utility.

  • 7/29/2019 document clustering based on topic maps using k-modes algorithm

    9/15

    SCREEN SHOT OF TOPIC MAPS

  • 7/29/2019 document clustering based on topic maps using k-modes algorithm

    10/15

    STEP 2:

    A Base name similarity matrix is developed usingWandoras Export utility..

  • 7/29/2019 document clustering based on topic maps using k-modes algorithm

    11/15

    STEP 3: Finally the K-Mode clustering is applied to obtain the final

    clusters

    K-MODES CLUSTERINGALGORITHM

    The K-modes algorithm uses a simple matching similarity

    measureto deal with numerical objects, replaces the means ofclusters with modes, and uses a frequency-based method to

    update modes in the clustering process to minimize the clusteringcost function

  • 7/29/2019 document clustering based on topic maps using k-modes algorithm

    12/15

    ALGORITHM

    Step 1: Select K initial modes, one for each cluster.

    Step 2:Allocate an Object to the Cluster whose mode isnearest to it according to Similarity measure.

    Step 3:After all objects have been allocated to clusters,retest the similarity of objects against the current modes. Ifan object is found such that its nearest mode belongs toanother cluster rather than its current one, reallocate the

    object to that cluster.

    Step 4:After allocating the objects to the new clustercalculate the mode using frequency based method.

    Step 5: Repeat 3, 4 until there is no change in the mode.

  • 7/29/2019 document clustering based on topic maps using k-modes algorithm

    13/15

  • 7/29/2019 document clustering based on topic maps using k-modes algorithm

    14/15

    REFERENCES

    Abstract : International Journal of Computer Applications(0975 8887) Volume 12 No.1, December 2010Data Mining Concepts : Micheline Kramber

  • 7/29/2019 document clustering based on topic maps using k-modes algorithm

    15/15

    THANK YOU