ontology extraction by collaborative tagging with social ...ymatsuo.com/papers/ · and cyworld is...

Ontology Extractionby Collaborative Tagging with Social Networking

Masahiro HamasakiAdvanced Industrial Science

and Technology (AIST)1-1-1 Umezono, Tsukuba

Ibaraki, Japan

[email protected]

Yutaka MatsuoAdvanced Industrial Science


Ibaraki, Japan

[email protected]

Takuichi NishimuraAdvanced Industrial Science


Ibaraki, Japan

taku [email protected]

Hideaki TakedaNational Institute of

Informatics2-1-2 Hitotsubashi,

Chiyoda-kuTokyo, Japan

[email protected]

ABSTRACTThis paper proposes integration of a social network with col-laborative tagging for ontology extraction. Tripartite mod-els of emergent ontologies based on three dimensions (i.e.users, tags, and instances) have been proposed by severalresearchers, but we integrate another important dimension:user-user relations, such as the friend relation in social net-working services and a knows relation in Friend-Of-A-Friend(FOAF) documents. Because a lightweight ontology is aminimum commitment shared within a community, who com-municates with whom is an important source of informationthat can be used to improve the emergent ontology. We alsodiscuss the advanced model in where each concept in eachcommunity is considered different (and called p-concept),and show the possibility of using this model to resolve thepolysemy/hononymy problem. Two case studies using ouralgorithms are shown: we analyze tagging and social net-working data from an academic conference support systemPOLYPHONET and also from an advanced social systemcalled Blue Dot. We evaluate the extracted ontologies forinformation recommendation, and show that our algorithmworks better than others.

Keywordsemergent ontology, social network, collaborative tagging

1. INTRODUCTIONThe World Wide Web is being increasingly integrated into

our everyday life by social media like blogs and social net-work services (SNSs). Information technology thus needs toprocess not only formal contents with a clear vocabulary,such as research papers and news articles, but also contentswritten in the ever-changing languages and ontologies we usein our daily life. This increases the importance of emergentsemantics, which tries to obtain ontologies and taxonomiesfrom a vast amount of information [22].

The recent spread of collaborative tagging and SNSs re-flects our social life. In Web applications such as del.icio.us,Flickr, CiteULike, and YouTube1, users manage, share, and

1del.icio.us, www.flickr.com, www.citeulike.org, andwww.youtube.com.

Copyright is held by the author/owner(s).WWW2008, April 21–25, 2008, Beijing, China..

browse collections of online resources with semantically mean-ingful information in freely chosen text labels, or tags [5].In such applications we see the emergence of loose catego-rization systems, folksonomies, that can be effectively usedto navigate through a large and heterogeneous body of re-sources. Collaborative tagging and emergent semantics aredescribed in the literature (e.g., [5, 9]) and are discussedonline (e.g., [11, 15]).

Social networking services are sometimes characterized associal systems on the so-called Web 2.0. MySpace, Facebook,and LinkedIN are among the most popular SNSs, but the se-lection of an SNS can also reflect cultural diversity. Orkut,for example, is popular in Brazil, mixi is popular in Japanand Cyworld is popular in Korea. In the context of the Se-mantic Web, a social network is recognized as an importantmechanism for extablishing the “web of trust” needed thecredibility and trustworthiness [8, 14]. As social networksoverwhelmingly influence our lives, many applications havepotential applicability of social networks [23].

The relation between social network and the emergence ofontologies in collaborative tagging is discussed in literature.Gruber defines an ontology as an explicit specification of aconceptualization [10]. It is used to describe the ontologi-cal commitments of a set of agents in a community, so thatthey can communicate about a domain of discourse withoutoperating on a globally shared theory. Mika discussing theways that a community evolves and its commitments changebecause its members are leaving and entering, pointed outthat to change ontologies we need a method for extractingan ontology from the large number of individual interactions[20]. That is, we need a scalable and easily maintainable Se-mantic Web. Mika’s model takes into consideration threeclasses of elements: actors, concepts, and instances. Foldingtheir ternary relations into binary relations, we can obtaintwo ontologies: concept relations based on instance over-laps, and concept relations based on actor overlap. Severalother studies have taken similar approaches using tripartitemodels.

Though the Mika model is an elegant model of actor-concept-instance relations, we can extend it. An importantfactor we treat in this paper is the actor-to-actor relation.Though it has been previously addressed in the context ofsocial networks, the network in the Mika model is an af-filiation network (or a two-mode network) of the relationsbetween actors and concepts. The affiliation network canbe folded to generate the association of concepts in terms

of overlapping instances or concepts. We integrate the Mikamodel with another kind of social networks called adjacentnetworks, which are the social networks one usually has inmind when thinking of the virtual communities based onSNSs or Friend-Of-A-Friend (FOAF) aggregation.

An adjacent network, where each tie represents a rela-tion between actors – such as knows, collaborates with, oris friends with – can enhance the Mika model. Our model,called the Homophilic Actor-Mediated Activation (HAMA)model, provides an emergent ontology that takes social re-lations into account. Intuitively, if a friend of mine tags aWeb page, that information is useful for me and it is rea-sonable to expect me to commit to the tag to some degree.This expectation depends on a lightweight ontology being aminimum commitment shared within a community. So I canshare my ontology with my friends and colleagues throughdaily communication.

We also argue the polysemy/homonymy problem: due tothe difference between the different ontologies in differentcommunities. To provide a general view, we consider theconcepts labeled by each community to be different even ifthey have the same label. In our definition, each one is calleda p-concept and has a completely personal meaning. We as-sume that every tag is first a p-concept, that two p-conceptsare merged into a shared one through social interaction, andthat merged p-concepts can in some cases eventually form aglobally shared concept. The key is again a social network;a shared concept is created through social relations.

The HAMA model improves extracted emergent ontolo-gies in that it alleviates the data sparsity problem by inte-grating social relations and that it provides a possible res-olution of the polysemy/homonymy problem by p-conceptgathering. It is applicable to social systems that provideboth social networking and social bookmarking functions.Facebook, flickr, and Digg are such systems. Among them,Blue Dot is one of the advanced systems that elaborately in-tegrates a social network with collaborative tagging, thus weanalyze the data. We analyze the data on POLYPHONET[17], which is a social networking extraction system for aca-demic conferences (social networking and bookmarking forpresentations). In this paper we show a case study using ouralgorithms on POLYPHONET and Blue Dot. We evaluatethe extracted ontologies for information recommendation.

Though the HAMA model is not broadly applicable tomuch of the data currently online, i t shows the potentialeffectiveness of the integration of collaborative tagging andsocial networking and of integrating FOAF data with otherSemantic Web metadata.

The contributions of this paper can be summarized asfollows:

• We propose a new method to extract lightweight on-tology (keywords association). By integrating socialrelations with collaborative tagging, it can extract on-tologies that have high coverage.

• We apply the algorithm to an academic conferencedata, describe the findings on the data, and show theimproved result for ontology extraction for informationrecommendation.

• Our approach shows the potential importance of theintegration of social networking data with collabora-tive tagging data.

This paper is organized as follows: the next section ex-plains the Mika model and some problems in collaborativetagging systems. Sections 3 and 4 explain the HAMA model,and Section 5 shows two case studies using the model. Theevaluation is made in Section 6. After the discussion in Sec-tion 7, we mention related work in Section 8 conclude thepaper with a brief summary in Section 9.

2. BACKGROUNDTagging data has often been handled by taking a unified

user-tag-resource (or actor-concept-instance) approach [9,13, 12] in which each tag by each user to each instance is con-sidered as a single instance of the form (tag, user, resource).In this section we introduce the work by Peter Mika becauseit is well described with relation to the Semantic Web.

2.1 Mika model: tripartite model of folkson-omy

Tagging data can be considered as if it were a hyper-graph. The set of vertices is then partitioned into the threedisjoint sets A = {a1, . . . , ak}, C = {c1, . . . , cl}, and I ={i1, . . . , im} respectively corresponding to the set of actors(users), the set of concepts (tags, keywords), and the setof instances (URLs, photos, movies, etc.). Users tag ob-jects with concepts, creating ternary associations betweenthe user, the concept, and the object. Collaborative taggingwould then be defined by a set of annotations T ⊆ A×C×Iconsisting a hyperedge in the hypergraph.

This tripartite hypergraph can be reduced to three (two-mode) graphs, one for each pair of the three elements:

Concept-instance graph (CI) It represents the relation of con-cepts by measuring instance overlaps, thus constituting alightweight ontology Oic.

2

Actor-concept graph (AC) It shows how actors are relatedby sharing the same concepts, thus constituting a lightweightontology Oac.

Actor-instance graph (AI) It shows how actors are related byoverlapping instances. We also get a network of instances.

The ontologies Oci and Oac are formularized as follows:We denote a matrix representing the relation of actors andconcepts as Bac = {bij} where bij = 1 if actor ai is affiliatedwith concept cj . Then it can be folded into a lightweightontology of concepts based on overlapping sets of communi-ties: Oac = BT

acBac. And the matrix Bci where bij = 1 ifconcept ci is affiliated with instance ij can be folded into alightweight ontology of concepts based on overlapping setsof instances: Oci = BciB

Tci.

2.2 Problems in collaborative taggingGolder describes three problems in tagging systems: poly-

semy, synonymy, and basic level variation [9]. A polysemousword is one that has many related meanings. “Window,” forexample, may refer to a hole in the wall or to the pane ofglass in that hole. Polysemy is similar to homonymy, wherea word has multiple unrelated meanings (a “pool,” for ex-ample, can be a collection of water, a game played on a felttable, or a betting group). Either polysemy or homonymy

2In the original paper it is called Oci, but here we change thenotation for clarity; Oic is an ontology for instance-basedconcept relations and Oac is an ontology for actor-basedconcept relations.

can degrade the correctness of an ontology, retrieval per-formance, and the efficiency of information sharing. Be-cause it is sometimes difficult to distinguish polysemy fromhomonymy, we do not distinguish them and use the termpolysemy whenever a word has multiple meanings.

Wu et al. propose a solution to the polysemy problem[24]. They use a probabilistic generative model to model theuser’s annotation behavior and to automatically derive theemergent semantics of tags. Synonymous tags are groupedtogether and highly ambiguous tags are identified and sep-arated.

Popular instances have lots of tags and thus for them thereis little problem of synonymy for navigation purpose. Forexample, a Web page might be tagged as “San Francisco,”“san francisco,” or “SF”; collectively we can guess they arerelated. For less popular pages, however, tags are not suf-ficient for efficient navigation, especially when a system isnewly launched or a user has narrowly focused interests.Some studies have tried to deal with this problem by clus-tering tags [3].

3. HAMA MODEL: INTEGRATION OF AD-JACENT NETWORK

We extend the Mika model by integrating a social net-work among actors. We call our model the HomophilicActor-Mediated Activation (HAMA) model. The conceptsof a user are activated through their social relation, and themodel is effective if the social relations are between similar(homophilic) users.

Let us assume we have a social network among users andthat we have the corresponding adjacent matrix. We denotea k × k matrix that represents relations between actors asA = {aij}, and each element aij is binary. If we take theproduct of A and a k × l matrix Bac, the resultant matrixABac is a k × l matrix. This matrix sums up the columnsof Bac corresponding to the neighbors of each actor. Theconcept-concept relations are then obtained by substitutingABac for Bac. That is, O′

ac = (ABac)T ABac. This ontol-

ogy represents the relations between concepts as measuredby common actors considering their neighbors in a socialnetwork.

Similarly, if we have some similarity measures of instancesrepresented by I, we can obtain the refined ontology O′

ic =(IBic)

T IBic. There are several other ways that actor rela-tions could be included in the Mika model. Therefore, weassume we have matrix D derived from matrix A. It can bea power of A, a probabilistic distance in A, or any distancemetric in A. We can state our model as follows:

¶ ³Definition 1 Depending on adjacency matrix A, wedefine some distance matrix D. Then the ontology Orepresenting the relevance among concepts is calculatedas

O = λ(DBac)T DBac + (1− λ)(Bic)

T Bic,

where λ is a constant.µ ´

Because our model is general, we can test several settingsof k1, k2, and distance matrix D. For example, if we increaseλ we have more weight on the actor overlap. In this case,the relevance of concepts is based on the contexts of actors

Figure 1: Illustration of two ontologies

(e.g., their geographical context). On the other hand, if wedecrease the weight of λ, the similar concepts are calculatedclose. Especially, if we make λ greater than 1 the resultantrelevance is a good approximation of synonyms. The Mikamodel is equivalent to understood as D = I (a unit matrix)and either λ = 0 or λ = 1.

From the matrix point of view, data sparsity is a lack ofelements in the matrix: The elements where the value shouldbe greater than zero may be zero because there is no data.In the HAMA model the zero element is the sum of valuedadded by the other elements if they are connected by matrixA. In our formulae, O′

ac has fewer zero elements than Oac

does and O′ic has fewer zero elements than Oic does. This

is illustrated in Figure 1.The effectiveness, however, depends on D; sometimes the

matrix D improves Oac, while other times it degrades Oac.Because the assumption is that if two actors are in closerelationship, the ontology would be similar. In other word, ifmatrix D is given, there are concepts that can be effectivelytransferred to neighbors and there are concepts that cannot.

4. EXTENDED HAMA MODEL: PERSONAL-CONCEPT-BASED ONTOLOGY

4.1 P-conceptFrom the matrix point of view, the polysemy problem can

be considered as follows. There is only one column (or row)where there should be two or more. The multiple meaningsare shrunk to one column. To resolve them, we have to sepa-rate the proper column into multiple columns. For example,we should have “friend1” and “friend2” instead of “friend”if there are two meanings for the tag. Each tag is classifiedinto either of them from the context.

This approach, however, is rather conventional. Becausecollaborative tagging provides very different views of cate-gorizations, there are no “true” categorizations. There areonly personal and social benefits of categorizations. Someresearchers found two categories of tags: those only for per-sonal use, and those sharable or intended for sharing [9,1]. Some tags have wholly personal meanings and othershave socially shared meanings. This dichotomy is inevitablyvague, and it is reasonable to think the personal meaning isgradually shared by neighbors and eventually shared society-wide. In this sense, we take a completely different approachto the polysemy problem. Cuttuto et al. also discusses that

the emergence of a folksonomy exhibits dynamic aspects alsoobserved in the evolution of human languages, such as thecrystallization of naming conventions, competition betweenterms, and takeovers by neologisms [5].

A tag has two aspects: personal and social. It helps usunderstand this phenomenon when we recall that Saussuredivided language into langue and parole. About 90 yearsago, Ferdinand de Saussure wrote an influential book ti-tled “Cours de linguistique g’en’erale,” (Course in GeneralLinguistics) that was published posthumously in 1916. Init Saussure focuses on what we call language, “a system ofsigns that express ideas,” and suggests that it might be di-vided into two components: langue and parole. Langue isa sign system generated by people socially, and parole is asign system specific to each person. Langue is generated bythe action of a collective intelligence. Each individual per-son generates langue through parole but cannot control it.Parole is prescribed by langue, but each individual persongenerates parole freely. Parole is represented by each indi-vidual person’s original combination of concepts and her/his“phonation.”

Langue and parole effect each other. Langue prescribesthe generation of parole, but generated parole causes langueto change. We see a similar relation between folksonomyand tags in collaborative tagging. Every user makes a com-bination of concepts and labels freely and uses them. Some-one uses “friends” for her/his friends, and someone else uses“friends” as the label for a television program. On the otherhand, every user is effected by folksonomy. Once tagging be-comes stable, users cannot avoid making tags following therule of the system because such tags are more useful to himand others.

Inspired by the Saussure’s two components langue andparole, we propose a new way to handle the user-tag-itemsystem. The idea is to initially consider each tag by eachactor as unique and then gradually relate them. 3

We consider the tag “friend” used by Alice different fromthe tag “friend” used by Becky. Only when we have enoughproof that the two tags are assigned to the same instances,or that the two have some proximity in communication, dowe think that the two tags are the same. We call each tagused by each person a p-concept4.

4.2 HAMA model based on the p-conceptWe extend the HAMA model by including p-concepts. If

we have k persons and l concepts, we could have as manyas k × l p-concepts. Therefore the HAMA model using p-concepts would have much larger matrices. We denote amatrix representing the relation of actors and p-concepts asPap = {pij} where pij = 1 if actor i is affiliated with p-concept j. We also define the matrix Pip = {pij} wherepij = 1 if p-concept i is affiliated with instance j. Definition1 is then extended with p-concepts as follows.

3In speech, people use various context-based combinationsof concepts and phonations. Someone might on one occa-sion use the phonation “friends” for acquaintances and onanother occasion use it as the name of television program.However, users use a tag which is regarded as phonation atcollaborative tagging to categorize contents. Consequentlyit is natural that each individual user has only one conceptfor each tag.4The p-concept itself is intentionally polysemous and canmean primitive concept, parole-like concept, or personalconcept.

¶ ³1. Put separately p-concepts. There can be at most

l × k p-concepts.

2. Merge concepts: Merge two p-concepts if

• their labels are the same and

• they share the same instances, or neighbor-ing actors

and regard the cluster of p-concepts as a singleconcept.

3. Make edges: Add an edge between two p-concepts if

• they share the same instances,

• they share the same actors, or

• they share the neighboring actors.

µ ´Figure 2: P-concept gathering.

¶ ³Definition 2 The ontology O representing therelevance of p-concepts is calculated as O′

pp =k1(DPap)T (DPap) + k2(Pip)(Pip)T , where k1 and k2

is a nonnegative constant.

µ ´The matrix O represents the relations between p-concepts inconsideration of all the effects of user-user, user-p-concept,and instance-p-concept relations.

Theoretically we can calculate the matrices, but O′pp is a

kl × kl matrix. So it is huge and not efficient to calculatematrix-wise even for a small number of users and tags. Weinstead use a procedural algorithm called p-concept gather-ing to calculate the ontology as shown in Figure 2. We firstassume that every tag is a p-concept and then merge twop-concepts into a shared one by considering overlap of in-stances and actors as well as the relations of actors. Finallythe algorithm forms socially-shared concepts. The algorithmin Figure 2 is very simple and can be improved in severalways. We can set thresholds on the number of shared ac-tors, instances, and neighboring actors to merge conceptsand make edges. Both Merge concepts and Make edges arebased on similarity among p-concepts using the number ofshared actors, instances, and neighboring actors.

4.3 Polysemy detectionP-concepts are merged through the communication of two

actors, but we cannot detect all the communication betweenactors even if we have lots of online data sources available.Not all the p-concepts will go through social channels. Be-cause the current technology still cannot find the enoughsocial ties, in this paper we show polysemy detection ratherthan the continuous degrees between personal and socialmeanings.

We first find the community in the social network. We usethe Newman algorithm [21] to find communities from a so-cial network. Recently, a series of effective graph-clusteringmethods has been advanced. Pioneering work that specifi-cally emphasizes edge betweenness was done by Girvan andNewman [7]. The betweenness of an edge is the number

of shortest paths between pairs of nodes that run along it.Girvan et al. [7] show that this clustering works well in var-ious networks, from biological networks to social networks.Numerous studies have been inspired by that work. Oneprominent effort is a faster variant of the Girvan-Newmanalgorithm [21], which in this paper we call Newman cluster-ing.

After obtaining communities, a concept is investigated tosee whether it conveys a different meaning. The conceptsrelated to a target concept are obtained. If the related con-cepts differ depending on community, it is likely that thetarget concept has different meanings in a sense that its as-sociations to other concepts are different. In this way, wecan detect polysemous words. Concrete examples are shownin the next section.

5. CASE STUDY OF EMERGENT ONTOL-OGY

This section describes case studies in which our modelwas applied to Polyphonet, an academic conference supportsystem, and Blue Dot, a commercial system providing socialnetworking and social bookmarking.

5.1 Emergent ontology in an academic confer-ence

Polyphonet Conference (hereafter, Polyphonet) is a com-munity support system that is tailored to academic con-ferences and that is partly based on social network miningfrom the Web [17, 16]. It has functions for social network-ing among conference participants and for the social book-marking of presentations. A user can bookmark interestingconference items (papers, demos, and posters) and explorerecommended presentations and other researchers. Poly-phonet has been used to promote participants’ communica-tion at several conferences: the 17th through the 21st annualconferences of the Japan Society of Artificial Intelligence(JSAI2003-JSAI2007) and the International Conference onUbiquitous Computing (UbiComp2005, UbiComp2006). Morethan 500 participants attended each conference, and 200used the system. We use the data of JSAI2005 in this paperbecause the users there were more active than those at theother three conferences.

The numbers of bookmarking and social networking userswere respectively 75 and 323. The social network of theusers is shown in Fig. 4, where the number of edges is 1014,the characteristic path length L = 3.289, and the clusteringcoefficient C = 0.486. The most active user had 77 edges,while an average user had 6.27 edges. This means that thenetwork had the characteristic of a small world. The coremembers were densely connected in the network. If we applythe Newman clustering, we can get 13 clusters. The biggestcluster has 91 social networking users and top six clustersinclude 90% of users.

We had 314 tags for the 297 papers presented at this con-ference. The most frequent tags are listed in Table 1. On theaverage, a tag was used by 11.6 users for 3.1 presentations.

5.1.1 Examples of ontologiesWe show some examples for emergent ontologies for an

academic conference. The ontologies represent the relationsbetween concepts in the papers presented at the conference.Ontologies Oic and Oac are shown in Figures 5 and 6. The

Figure 3: Screenshot of Polyphonet Conference

Table 1: Frequent tags on Polyphonet.tag users instances records

individual 37 10 125content 37 9 59

extraction 36 14 56communication 35 9 82environment 35 15 56

network 35 7 54concept 34 10 48image 33 7 45search 33 8 60

knowledge 31 16 57

thresholds are determined so that the numbers of edges arethe same in both ontologies (100 edges). Though the twoontologies are not easy to compare, there are some differ-ences between them. For example, in Oic we can find noconnection between the concepts “gene” and ”interaction”but in Oac they are connected. JSAI2005 had a session re-lated to artificial life, and that session referred to “gene”and “interaction.” This means that hidden relevance can beclarified by considering common actors.

The ontology obtained by the HAMA model (Figure 7)actually consists of two ontologies Oic and O′ac. It is moreevenly distributed. The concepts can be related more closelyby considering friends and acquaintances information. Thisshows that the HAMA model is effective against the datasparsity problem.

We investigated network centralization by using Bonacicheigenvector centrality [4] as a measurement. High central-ization means that the network is has a star-like shape, witha large difference between the nodes with the highest andlowest centrality. The centralization of Oac is 103.04, whichis lower than that of Oic (51.11). This can be interpreted asindicating that general or ambiguous words appear in var-ious contexts in documents and thus have many relationswith other words. The centrality of our HAMA model is

Figure 4: Social network in Polyphonet (353 nodesand 1014 edges).

Figure 5: Concept relations in Polyphonet revealedby concept-instance relations (Oic).

even lower than that of Oac (45.8), but we found the dif-ference between them from Figure 6 and 7. The network ofthe HAMA model formed clusters of words. This means thatconcepts are merged into clusters based on communities inthe social network.

5.1.2 Result of p-concept gathering and polysemydetection

We have 314 tags for the 297 papers presented at thisconference. here were 104,841 p-concepts representing tagsby different users (Figure 2-1). The average user had 105.4p-concepts, and the user with the most p-concepts had 184.

As we apply p-concept gathering described in section 4.2,the concepts emerge as a cluster of p-concepts emerges. Be-ginning with 6222 p-concepts, we would get 3978 concepts ifwe merged neighboring p-concepts (Figure 2-2) and wouldget 536 concepts if we merged them according to common in-stances. As we used both merging by neighbors and mergingby common instances, we got 474 concepts, which is similarto the number of original keywords (314).

Let us examine the p-concept “agent.” Fifty users hadthis p-concept and their p-concepts were merged into fiveconcepts. The biggest one had 47 users, the second largesthad six users, and the others had only one user. The biggest

Figure 6: Concept relations in Polyphonet revealedby actor-concept relations (Oac).

Figure 7: Concept relations in Polyphonet revealedby the HAMA model.

indicates “agent” as a program with a function to interactwith users. The second indicates “agent” in the contextof machine learning. If we examine the p-concept “game,”We find that twelve users had it and theirs were mergedinto two concepts: game as game theory and game as anentertainment.

Next, we applied the polysemy detection described in sec-tion 4.3 to the same data. Table 2 shows the different co-occurrences in different communities of the words “agent”and “game.” The p-concepts which had the label “agent”were merged into five concepts when we applied p-conceptgathering. In polysemy detection, the concept “agent” wasfound in six communities. In cluster2 and cluster8, we caninfer that it indicates a program for processing Web con-tents. In cluster5 and cluster6, we can infer that it indicatesa robot that works collaboratively. A community generatedby the clustering for polysemy detection is a user’s commu-nity based on social networks. It is natural that the sameconcepts are used in different communities (e.g., in cluster2and cluster8 and in cluster5 and cluster6).

It seems that the concept “agent” in cluster2, cluster5,cluster6, and cluster8 indicates the concept “agent” in thebiggest merged concept on p-concept gathering: as a pro-gram with a function to interact with users. From this re-sult, we found that our concept merge method in p-concept

Table 2: List of polysemy tags that have differentco-occurrences in communities in Polyphonet.

tag cluster co-occurrence tagsagent cluster2 movement

cluster3 news, CMS, keywordcluster5 architecture, robot, statuscluster6 architecture, collaborative workcluster7 eyes, human, hypothesiscluster8 category, page

game cluster2 learning, entity, degreecluster5 behavior, prisoner, simulationcluster6 a sense of incongruity,

detection, learning

Pajek

Figure 8: Social network in Blue Dot.

gathering is too rough. We need to improve the method formerging them gradually. Anyhow, we found that a com-munity based on social networks can highlight polysemouswords.

The polysemy problem is deeply related to the correctnessof an ontology. It may degrade the correctness of ontol-ogy, retrieval performance, and the efficiency of informationsharing. From the experimental results, we found that so-cial networks can help to find polysemous tags. This is oneof the potential important benefits of integrating social net-working data with collaborative tagging data..

5.2 Emergent ontology in Blue DotWe have another dataset showing the effectiveness of our

approach. Blue Dot is a service that was launched in July2006. Intended to use social networks for sharing digital in-formation, it allows a user to track what the user’s friends inthe social network are bookmarking or “dotting.” BecauseBlue Dot is one of the most advanced systems that pro-vide social networking and social bookmarking functions,we choose it as a good testbed for the HAMA model andp-concept gathering.

The data we used consists of 964 users with 28018 differ-ent tags for 76237 URLs. The numbers of users of socialnetworking and social bookmarking were respectively 2981and 964. The social network is shown in Fig. 8. The mostactive user in the network had 393 edges, while an averageuser had 6.12 edges5. L = 3.9122 and C = 0.3129.

The most frequent tags are listed in Table 3. We can

5There were 869 unidirectional edges, but we consideredthem to be nondirectional.

Table 3: Frequent tags in Blue Dot.tag users instances records

news 616 11827 13505music 421 3545 4019blogs 394 3382 3791video 352 3599 4191movies 303 1585 2023

shopping 300 2138 2433books 264 1533 1810food 208 1064 1348art 170 1403 1549

funny 166 1509 1698

Table 4: List of tags highly correlated with sharingof social relation.

tag χ2 persons densityrestaurant 1083.7 49 0.238

jon+stewart 791.0 13 0.727bdmentions 742.3 15 0.604

katrina 726.4 26 0.337global warming 720.0 17 0.525

see that tags related to hobbies (e.g., “music”, “video”, and“books”) were used more frequently in Blue Dot than inPolyphonet. Table 4 lists tags highly correlated with thesharing of social relation. χ2 indicates the tag’s correlationwith sharing of social relations, persons is the number ofusers who have a target tag, and density is the density ofsocial networks that consists of users who have a target tag.We can infer the rough context of social networks on BlueDot. Some keywords related to locality (e.g., “restaurant”and “Katrina”) are listed in Table 4. From this result, wecan infer that many of the social networks in Blue Dot arebased on locality.

Figures 9, 10, and 11 show the extracted ontologies Oic,Oac, and O. We can see that the Oic has distributed clusters.If we consider the actor dimension by Oac, it is more con-nected than the Oic. If we consider the social relations be-tween users, all the concepts are related in some ways. Thistendency is similar to that seen in the examples of emergentontology in Polyphonet. This shows the effectiveness of theHAMA model because the hidden relations behind conceptsare obtained.

6. EVALUATION THROUGH RECOMMEN-DATION

We used a task-based evaluation to compare ontologies.An extracted lightweight ontology can be used for informa-tion recommendation. For users who tag one paper, we canrecommend other papers based on an ontology. If one on-tology is better than another, it will produce higher rate ofacceptance than the other does. Assumed hypothesis is: auser will have a higher probability of accepting papers rec-ommended on the basis of instances with ontological prox-imity.

We gathered the tagging data of 75 users and simulatedwhether a user will accept a recommended instance by on-tological similarity. It is a kind of cross-validation, where atraining set and a test set is separately used. We measure theontological similarity from the training set, and then providerecommended instances to each user. If a user checked the

Figure 9: Concept relations in Blue Dot revealed byconcept-instance relations (Oic).

Figure 10: Concept relations in Blue Dot revealedby actor-concept relations (Oac).

instance in the test set, the recommendation was consideredcorrect; otherwise it was considered incorrect. Though usersdid not tag all the instances they might have been interestedin, this figure is an approximation of the effectiveness of anontology.

The results are shown in Fig. 13 and Fig. 14. Figure 13plots average precision of the recommendation against theamount of training data. If we use more data, the precisiondecreases because the number of correct answers in the testset decreases. The precisions are almost same for all threeontologies, but with the smallest amount of training datathe precisions for Oac and OHAMA are a little higher thanthe precision of Oai. We must tune some parameters andthresholds when we use these methods in recommendationservices. In this experiment, the system recommends allinstances that have ontological proximity r which is greaterthan zero. For that reason, all precisions are low.

Figure 14 plots the average recall of the recommendationagainst the amount of the training data. It shows that therecalls of Oac and OHAMA are higher than that of Oai. Es-pecially, the recall of Oac with the HAMA model is higherthan that of the others when there is little training data.This result indicates the effectiveness of the HAMA modelagainst the data sparsity problem.

In summary, precision is almost the same for all three

Figure 11: Concept relations in Blue Dot revealedby the HAMA model (O).

ontologies and the recall of OHAMA is better than that ofthe other ontologies. This result indicates that our proposedmethod can extract ontologies more exhaustively, which cor-responds to the results reported in the relevant literature:Bao showed that keyword associations based on social anno-tations can improve information retrieval [2]. Our ontologyOHAMA will help increase information retrieval recall.

Our proposed model is better than other models when lit-tle training data is available, and data sparsity is generallyone of the biggest problems in the early stages of social ap-plications. The result indicates the potential importance ofthe integration of social networks for this problem.

7. DISCUSSIONWe can see that the precision of Oic sometimes is higher

than that of Oac and OHAMA. In this experiment, we setan extreme parameter and simple distance matrix in orderto show the differences between ontologies clearly. Precisioncan be improved by adjusting this parameter and this ma-trix. Our proposed model is better than others when littletraining data is used, and Oic and Oac are more effectivewhen there is a lot of training data. It seems that reducingλ and the difference between DBac and Bac and increasingthe amount of training data is effective. On the contrary, ifwe need effectiveness of social relations more and more, wecan generate matrix D in several other ways, e.g., Distant-based: We make D so that element (i, j) is 1/dij wheredij is the distance between actor i and j. Propagation:We apply the neighbor matrix multiple times. The resulantmatrix means the probabilistic distance from one actor toanother actor.

We would like to discuss the reasons that the precisionof our model is not high compared to the recall. We ex-plained in section 3 that our model is based on the fact thatontology (keyword association) is deeply related to social re-lations. There are, however, many kind of social relations.If the context of social relations differs from the context ofan application, matrix D deteriorates Oac. In this paperthe application was a information recommendation of anacademic paper. We infer that coauthor relationships andcolleague relationships are desirable. It is likely that mostsocial relations in Polyphonet are similar to them becausethe system is one for academic conference assistance. Users

¶ ³1. Prepare a training dataset of annotations Taci and

distance matrix Daa.

2. Make two matrixes Bac and Bic from Taci and threetypes of ontologies:

• Oac = BTacBac

• Oci = BciBTci

• Ohama = λ(DBac)T DBac + (1− λ)(Bic)T Bic

In this experiment, we set λ = 1.0 in order to showthe differences between ontologies clearly.

3. Prepare a test dataset of annotations T′aci and maketwo matrixes: B′ac representing the relation of ac-tors and concepts, and B′ai where b′ij = 1 if actor ai

tagged instance ij .

4. Calculate ontological proximity r of actors and in-stances

b′a1c1∈ B′ac, bi1c1 ∈ Bic, oc1c2 ∈ Oic/ac/hama,

raiij =X

cx∈B′ac

X

cy∈Bic

b′aicxocxcy bijcy

5. Recommend instances to actors whose ontologicalproximity with them is higher than the thresholdvalue.

6. Evaluate the results of recommendation. If instanceij was recommended to actor ai, and

• bij = 1, bij ∈ B′ai, the recommendation wascorrect

• bij = 0, bij ∈ B′ai, the recommendation wasincorrect

µ ´Figure 12: Process of Evaluation.

can register their own social networks freely, however, sothere are some social relations that are not related to thecontext. It is possible that such relations become a sourceof noise for the application and reduce the precision.

For example, when an application is a service for recom-mending a restaurant or an amusement park, it is desirablethat social relations are based on local communities. It isimportant to consider a relationship among a task and ashared ontology on social relations. Our proposed methoddescribed at section 4.3 is one way to solve this problem. Ta-ble 2 shows that “game” has different co-occurrences in dif-ferent communities. If in some community the co-occurrencewords of “game” are “XBOX” or “Nintendo,” that commu-nity will probably not be a suitable one for the recommen-dation of academic papers.

8. RELATED WORKSIn information retrieval, word clustering (or LSI, random

projection) will sometimes improve the results because itmerges columns of a word-to-document matrix. making thematrix smaller. This merging Improves text classificationand increases retrieval performance.

We should point out that the HAMA model is effectivewhen the concept usage of a person is similar to that ofthe person’s neighbors. If not, the resultant matrix may

Figure 13: Precision of information recommenda-tion based on Polyphonet data.

Figure 14: Recall of information recommendationbased on Polyphonet data.

be inferior to the original one. However, when we take thematrix by friends or colleagues data, this seems to work wellbecause of the very nature of emergent ontology: lightweightontology is the mutual understanding among members.

Our work is generally based on the work of Mika [19],which we described in detail in this paper. Several re-searchers have investigated ways to extract word associationand general knowledge and ontologies from the Web. Cimi-ano and his colleagues developed a system, called Pattern-based ANnotation through Knowledge On the Web (PANKOW),that which assigns a named entity into several linguisticpatterns that convey semantic meanings [6]. Ontologicalrelations between instances and concepts are identified bysending queries to a Google API based on a pattern library.

In recent years, several studies analyze actual tag data.Golder analyzed the structure and dynamic aspects of col-laborative tagging systems [9]. They found regularities inuser activity, tag frequencies, and bursts of popularity inbookmarking in del.icio.us data. Lambiotte, like Mika, pro-posed a tripartite network but used it to analyze the struc-ture of the network for Last.fm and CiteULike [12]. Cat-tuto et al. report a statistical analysis of tagging activityin del.icio.us and Connotea by introducing a new version ofYule-Simon model where “the rich get richer” [5]. Folkson-omy is a kind of experiment in collective intelligence, thehallmark of what is called Web 2.0 [18].

In our work we use a social network, and there are variousmethods that can be used to obtain social networks. Au-tomatic detection of relations is also possible from varioussources of online information such as e-mail archives, sched-ule data, and Web citation information. Friend-of-a-Friend(FOAF) provides a vocabulary that can be used to describeinformation on a person and that person’s relations to oth-ers. We can collect FOAF files and obtain a FOAF network[19]. With the increase of FOAF profiles and tagging data,our proposed method will have increasing applicability inthe future.

9. CONCLUSIONIn this paper we proposed integration of a social network

with a tripartite model of ontologies. The model, which wecall the HAMA model, is based on three dimensions (actors,concepts, and instances) and integrates another dimension:actor-actor relations. Inspired by Saussure’s contrast be-tween lang and parole, we also introduce p-concepts, whichare the concepts conceived by individuals before they areshared by many people. P-concepts are gathered and rec-ognized as a common concept by our algorithm. We showa case study in Polyphonet and evaluate our algorithm byapplying it to the recommendation of information.

Experimental results on two datasets, Polyphonet andBlue Dot, provide the improved result of ontology extractionfor information recommendation. An ontology extracted bythe HAMA model was, compared to ontologies extracted byprevious models, better with respect to coverage. To im-prove precision, we need to tune the parameters elaboratelyand carefully select social relations to use. Therefore, oneadvantages of the integration is toward data sparsity prob-lem.

Social bookmarking can be seen as a process of conceptarticulation for things. We believe that integrating com-munication with collaborative tagging will provide a clearerview of conceptualization on the Web. We hope our studywill contribute to various Semantic Web studies by bringinga new dimension to emergent ontology.

10. ADDITIONAL AUTHORS

11. REFERENCES[1] M. Ames and M. Naaman. Why we tag: Motivations

for annotation in mobile and online media. In Proc.CHI2007, 2007.

[2] S. Bao, X. Wu, B. Fei, G.-R. Xue, Z. Su, and Y. Yu.Optimizing web search using social annotations. InProc. WWW2007, 2007.

[3] G. Begelman, P. Keller, and F. Smadja. Automatedtag clustering: Improving search and exploration inthe tag space. In Collaborative Web TaggingWorkshop, WWW2006, 2006.

[4] P. Bonacich. Factoring and weighting approaches tostatus scores and clique identification. Journal ofMathematical Sociology, 2:113–120, 1972.

[5] C. Cattuto, V. Loreto, and L. Pietronero.Collaborative tagging and semiotic dynamics. 2006.

[6] P. Cimiano, G. Ladwig, and S. Staab. Gimme´ thecontext: Context-driven automatic semanticannotation with cpankow. In Proc. WWW 2005, 2005.

[7] M. Girvan and M. E. J. Newman. Communitystructure in social and biological networks.Proceedings of National Academy of Sciences USA,99:8271–8276, 2002.

[8] J. Golbeck and J. Hendler. Inferring trustrelationships in web-based social networks. ACMTransactions on Internet Technology, 7(1), 2005.

[9] S. Golder and B. A. Huberman. The structure ofcollaborative tagging systems. Journal of InformationScience, 2006.

[10] T. Gruber. Toward principles for the design ofontologies used for knowledge sharing. InInternational Journal of Human-Computer Studies,volume 43, pages 907–928, 1995.

[11] T. Gruber. Ontology of folksonomy: A mash-up ofapples and oranges, 2005. tomgruber.org.

[12] R. Lambiotte and M. Ausloos. Collaborative taggingas a tripartite network. 2005.http://www.citeulike.org/user/FG/article/484851.

[13] C. Marlow, M. Naaman, d. boyd, and M. Davis. Ht06,tagging paper, taxonomy, flickr, academic article,toread. In Proc. HyperText 2006, 2006.

[14] P. Massa and P. Avesani. Controversial users demandlocal trust metrics: an experimental study onepinions.com community. In Proc. AAAI-05, 2005.

[15] A. Mathes. Folksonomies – cooperative classificationand communication through shared metadata, 2005.www.adammathes.com.

[16] Y. Matsuo, M. Hamasaki, H. Takeda, J. Mori,D. Bollegala, Y. Nakamura, T. Nishimura, K. Hasida,and M. Ishizuka. Spinning multiple social networks forsemantic web. In Proc. AAAI-06, 2006.

[17] Y. Matsuo, J. Mori, M. Hamasaki, H. Takeda,T. Nishimura, K. Hasida, and M. Ishizuka.POLYPHONET: An advanced social networkextraction system. In Proc. WWW 2006, 2006.

[18] P. McFedries. Folk wisdom. IEEE Spectrum, 2006.

[19] P. Mika. Flink: Semantic web technology for theextraction and analysis of social networks. Journal ofWeb Semantics, 3(2), 2005.

[20] P. Mika. Ontologies are us: A unified model of socialnetworks and semantics. In Proc. ISWC2005, 2005.

[21] M. Newman. Fast algorithm for detecting communitystructure in networks. Phys. Rev. E, 69, 2004.

[22] S. Staab. Emergent semantics. IEEE InteligentSystems, 17(1):78–86, 2002.

[23] S. Staab, P. Domingos, P. Mika, J. Golbeck, L. Ding,T. Finin, A. Joshi, A. Nowak, and R. Vallacher. Socialnetworks applied. IEEE Intelligent Systems, pages80–93, 2005.

[24] X. Wu, L. Zhang, and Y. Yu. Exploring socialannotations for the semantic web. In Proc.WWW2006, 2006.

ontology extraction by collaborative tagging with social ...ymatsuo.com/papers/ · and cyworld is...

Documents