[ieee education (iccse) - nanning, china (2009.07.25-2009.07.28)] 2009 4th international conference...
TRANSCRIPT
An Ontology-based Event Matching Dealing with Semantic Heterogeneity in Pub/Sub Systems
Wenting Si
dept. Computer Science and Technology
Tongji University Shanghai, China
Xiaoping Xue dept. Information and Communication
Engineering Tongji University Shanghai,China
Xiaoping Wang dept. Computer Science and
Technology Tongji University Shanghai, China
Abstract—Pub/Sub Systems is well suitable for asynchronous
exchanging information in distributed systems. However, owing to the existence of numerous communication participants from diverse organizations under an anonymous communication mechanism, the semantic heterogeneous issue in event-matching is fundamental. It affects the accuracy and comprehensiveness of event matching. But most existing content-based Pub/Sub systems are unable to support semantic. Therefore, in order to eliminate semantic heterogeneity happened on event matching, ontology is introduced in this paper. And considering the support of current content-based Pub/Sub systems infrastructure, we firstly put forward an approach to convert the original content-based data model to a new concept-based data model applying multidimensional-index(X tree) on a high level shared ontology library, and then raise relevant event-matching arithmetic using the cover of subscriptions information and matching order to accelerate the speed of matching. The experiment demonstrates that the arithmetic can be able to tackle the issue to some extent, meanwhile, increase the efficiency of event matching.
Index Terms— Pub/Sub System; Ontology; Event-matching; multidimensional-index
I. Introduction Pub/Sub systems, a kind of distributed systems model
consisting of event agents, publishers and subscribers, can solve the issue how participants exchange information in loosely coupled ways more efficiently than other traditional distributed systems and are suitable for current large-scale Internet applications, grid computing and mobile computing due to its asynchronous, many-to-many communication feature. Therefore, Pub/Sub systems are widely used in a variety of applications, such as stock-market[1], RSS feeds[2].
Pub/Sub systems models mainly classified into two types: Subject-Based and Content-Based. The content-based Pub/Sub systems (e.g. Siena[3], Elvin[4], Gryphon[5]), in which the content of subscription the subscribers interested in can be depicted with a sequence of constrain condition, are more flexible and comprehensive to match event than the subject-based ones(e.g. MQSeries[6], Scribe[7] ) in which the content of the subscription is unavailable to publishers except the topic.
In this distributed systems, numerous publishers and subscribers from different organizations and applications exist and exchange events under an anonymous communication
mechanism. That means communication participants are unknown to each other. As a result, owing to the cultural or linguistic differences, publishers and subscribers are unable to reach the same agreement on the description of information, inducing the semantic heterogeneity. For instance, 'film' and 'movie' are synonyms. And in some cases, even the words are the same, they can express various semantic in diverse fields (e.g. 'notebook' in computer fields and in stationary fields). Event-matching mechanism is in charge of making the event find the subscribers who are interested in it. If this mechanism can not eliminate the semantic heterogeneity, the event agent in the Pub/Sub systems will handicap the dissemination of the events to the interested subscribers. So the issue in event-matching is fundamental.
Most existing content-based Pub/Sub systems can't support the semantic event-matching mechanism, but they are widely accepted and used. We proposed the use of ontology to share a common vocabulary relationship to increase the systems' understanding about the publisher and subscriber's mutual interests. Meantime, we put forward a new multidimensional-index event-matching mechanism relying on data model of the typical existing content-based systems in order to matching efficiently based on ontology and support the existing systems at the same time.
This paper is organized as follows. In section 2 related works is presented to introduce the current event-matching arithmetic. In section 3, ontology-based approach is described. In section 4, the result and analysis of the experiment which implement the event-matching arithmetic are presented. In section 5, we present the conclusions and future work.
II. Related Work In content-based Pub/Sub systems, the major event-
matching algorithms focus on the interrelationship between subscriptions, such as cover[8] and merge[9]. Because it can minimize the space complexity and time complexity through taking advantages of this coverage in subscriptions.
Currently, most approaches in content-based systems apply name-value data model, which mainly use predicate-match to realize the event-matching, can be divided into direct predicate-match and indirect one in the term of sequence of match. The former strategy matches the subscriptions described as tree (graph) through deep searching, such as
Proceedings of 2009 4th International Conference on Computer Science & Education
978-1-4244-3521-0/09/$25.00 ©2009 IEEE 1225
matching tree[10] and binary decision diagram (BDD) algorithm[11]. The latter strategy implement matching through establishing one or multidimensional index structure which bring the acceleration of speed of positioning of relevant subscriptions' information, such as counting algorithm and clustered index algorithm[12,13]. In some matching strategies, in order to improve the efficiency of matching case, some references[14] will focus on the matching history and matching order.
However, these algorithms based on name-value have simple structure which can benefit matching event to make the process more efficient but poor ability of expression. They intrinsically can't solve semantic heterogeneity issue.
In order to enhance the system's expression capacity, other matching methods (e.g. XML-based and Map-based) have been proposed. Similarly, they also can not solve semantic understanding problem of systems fundamentally.
Therefore, to solve the problem of semantic heterogeneity in event-matching, the following two aspects should be taken into account:
• To resolve the relationship of data (e.g. equivalence relations, inheritance relations). That means systems need to catch the complete explicit semantic information by accurately defining the relationship between meta-data. Ontology is a formalized definite specification about shared concepts[15]. The feature of ontology (conceptualized, define, formalized and shared) provides a common understanding in a certain range to achieve the elimination of the ambiguity arising in various application environments. So the using of ontology is contributed to make the information exchanged between publisher and subscription explicit.
• To resolve the data mapping. Ontology is a specification describing the relationship between the concepts. But without the mapping mechanism, the involvers could not be aware of the relationship of the data (such as the synonyms of concept A, the derivative of A, especially in the context). So the mapping mechanism benefits the semantic convert to eliminate the heterogeneity.
At present, in order to support semantic matching, few semantic-based experimental prototypes have emerged such as OPS[16] in Chinese Institute of Science and Technology, JTangPS[17] in Zhejiang University. The ontology is introduced in the form of RDF[18] in these systems, and the event-matching is on the base of RDF graph matching. The structure of the subscriptions and events models in the systems are complex. In addition, publishers and subscribers should definitely know about the relationship of the concepts to publish or subscribe in the form of RDF uniformly, which is inconvenient and inflexible under the asynchronous, decoupled communication ways.
So ontology is a best option to deal with semantic issue, and at the same time, we hope the semantic relationship is also transparent for publishers and subscribers to exchange their information in original way, such as SIENA[3].
III. An Effect Event-Matching Approach Supporting Semantic
Heterogeneity Because of the simplicity of the data model, which is
express by a set of (name, type, value) and (name, type, op, value), in the widely used existing content-based systems, communication involvers can communicate without prior known about the definition of the data relationship. Therefore, the event-matching approach proposed in this paper is on the base of traditional content-based pub / sub systems.
When solving the semantic heterogeneity, we must take the two aspects mentioned above into account (i.e. to resolve the relationship of data; to address the data mapping). Hence, the thought of the approach to supporting semantic heterogeneity should include:
• Establish the common-shared ontology library, in which the data can express in the form of concept. In the process of establishment, we define the relationship the ontology support and the structure to save the ontology to make assure it can inquire efficiently.
• According to the information of the ontology library, map the original data structure (name-value) to the new concept-based data structure.
• Give an event-matching approach on the new data model, using the coverage of subscriptions and constrain condition.
A. Establish Common-shared Ontology We establish ontology library on high level to be an
independent module. Once event-matching happened, the event agent can achieve semantic identification through inquire the library module without caring about the infrastructure. What's more, considering the participants in the systems come from various organizations, each ontology in the library is denoted to a domain. We call it domain ontology identified by a unique name.
Ontology has not unified formalized definition[19] due to the characteristic of various application environment. In this ontology library, the ontology we defined can be express in a 5-tuple(C, PC, R, F, A). C means set of conceptions and PC is the set of conception's attribute. R shows the relationship of the ontology support. Here, we mainly define 3 relationships, DOM(R)={owl:equivalent, rfds:subClassOf, owl:attributeOf}. F means a set of function according to the relationship:
• owl:equivalent(C1,C2) means C1 and C2 are synonym. So when the subscriber is interested in C1, he is also interested in C2.
• rfds:subClassOf (C1,C2) means C2 is the subclass of C1. So when the subscriber is interested in C1, he is also interested in the child of C2. (e.g. The person who is interested in "computer", he should be interested in "notebook" and "desktop").
• owl:attributeOf(C1, Pc1) means Pc1 is the attribute of C1.
Proceedings of 2009 4th International Conference on Computer Science & Education
1226 978-1-4244-3521-0/09/$25.00 ©2009 IEEE
A is a group of predicate logic expression to express the constrain rule in ontology. Here we rule that synonym is of transitivity and concepts can't inherit each other.
Fig1(a) show a ontology graph in accordance with the ontology we defined.
After defining the ontology, we should care about how to structure the ontology for event-matching mechanism to find the semantic information efficiently. Multidimensional-index structure has been widely used in many application fields such as database, image search in order to allow users to acquire information swiftly. X tree[21] is a kind of Multidimensional-index structure, which combines class hierarchical structure with super-node (i.e. a size-flexible class linear array). It is suitable for express ontology relationship hierarchy on each conception.
So here we structure ontology base on X tree. Fig1(b) shows the ontology structure. First, a Hash table is built to mapping each concept or property to the domain it involved, and then we can get the super-node through the domain link if it is a concept. The super-node is an array expressed as (core-concept, attribute, child, parent), in which the core-concept represents all the synonym in the domain (i.e. all synonym will map in the same super-node), attribute saves all the relevant attributes it owns, child links to the subclass it has and super-class links are in parent. The property of the concept is an array expressed as (property, owner), owner links to the owners of the property.
Therefore, we can express the relationship we defined explicitly through the index.
B. The Conversion of Subscriptions The events and subscriptions in the content-based Pub/Sub
systems can be described as:
• Event=∪Attribute, Attribute=(name, type, value );
• Subscription= ∪ Filter, Filter= ∩ Constrain, Constrain=(name, type, op, value)
Each event agent store the global or part of the subscriptions which don’t include semantic information originally, so we should use a map function to convert them into a new model in the term of the ontology library for a semantic purpose and restored them.
In an original set of Constrain in a Filter, the name can represent a concept or the property in the library, which always has semantic relationship between each other (e.g. (buygood, String, =, computer), ( buyprice, Float,<10) ). So here we hope use concept to rebuilt the data model, which will defined as (concept, FilterC, property, FilterP, url ), FilterC means the constrain of concept, property means the set of the attributes of concept, FilterP is relevant constrain to property.
When each Constrain in Filter is converted, the comprehensiveness of the subscriptions should be assured. So three conditions should be taken into account, the main thought is presented in the follow:
• When Hash(name1) in Constrain1 (name1, type1, op1, value1) can map to a super-node of a concept, we use the core-concept to replace the name1. If no same core-concept and corresponding domain exist in the converted Constrain in a Filter. We built a new (concept, FilterC, property, FilterP, url) and put the core-concept into concept, and (type1,op1 value1) and the domain of the concept into FilterC, url respectively. If existing, we just put (type1,op1 value1) into FilterC. Here, value1 should be mapped.
• When name1 is a property, we will deal it later until all concept is converted. At last, we should find which concept owns the property and put name1, (type1,op1 ,value1) in the relevant property and FilterP.
• If name1 doesn't exist in ontology library, built a new constrain like the first.
But one problem should be considered that is which
domain to choose if a concept exists in several domains. So in this occasion, context should be thought about. For one
Figure1. Ontology Establishment
Figure2. Instance of FP Support Counting
Proceedings of 2009 4th International Conference on Computer Science & Education
978-1-4244-3521-0/09/$25.00 ©2009 IEEE 1227
concept, it has different attributes and relationship with other concepts in different domains. Owing to association or context in one Filter in a subscription, we will use support counting based on FP tree[20], Fig2 shows a instance. If a concept emerges in one domain, the support count of concept and domain will increase, the property and the subclass of the concept will also increase the count. In the end, we will choose the one getting the max support if a concept can map to more than one domain.
When event reach, the Attribute of the event should be convert (i.e (namei, typei, valuei)} ⇒ {(concepti, AttrCi, propertyi, Attrpi, urli )}. The procedure of the conversion is just like that of subscriptions. The only difference is that we should add the super-class of the concept to the event expression in view of the subscriber's interest in the subclass of the concept.
C. Event-Matching Approach The efficiency of event-matching is a non-ignorable
performance criterion. Here we will enhance the efficiency through using the coverage of subscriptions and the constrain condition of the concept or property.
In the process of conversion, the subscriptions have been merged due to the semantic relationship. But when the event arrive, that each piece of Constrain must be matched is obviously irrational. We will find these Constrains in Filter in the subscriptions can be clustered by concept. Consequently, we just need to match the Constrain whose concept is relevant to the concept involved in the event.
In addition, when we match the a constrain condition, we should match the (type, op, value). As a result, the constrain condition can be clustered by the type.
Meanwhile, the constrain condition of the concept or property (i.e FilterC or FilterP ) have relationship of coverage . Here we call FilterC and FilterP as F temporarily. When F1 is fulfiled successfully, F2 is also fulfilled successfully, that calls F2 ≺ F1 (e.g. (Float, <, 10) ≺ (Float, <, 11)). Due to the coverage, the difference in the matching order causes the difference in matching efficiency, which is proved in [14].
So assuming the time to match Fi needs FCi, and the possibility to be successful is FPi, The matching time of one Filter in the subscription will be (1)
Cs=FC1+ (1-FP1) ×FC2+...+n
ni FCFP ×−∏ −
=)1(1
1i (1)
But how to find the optimized match order is a NP problem, so here we just think about the coverage relationship between two F, and apply the establishment of constrain cover tree to achieve part optimization, and the time of complexity is 0(log2n×n).
When F1 cover F2, then F1 links to F2. Once an F is matched successfully, then the others on the branch are thought to be successful.
In the procedure, the speed of matching depends on the extent of coverage. The worst needs∑ =
n
i iFC1
, and the best just need execute once.
So when the event arrives, the event agent will convert the event model and match based on the index structure like Fig3.
Step1: After event E‘s Attribute convert from (name, type,
value) to (concept, AttrC, property, Attrp, url ). Search each of the Attribute, and get the concept. Using the index structure above to position the same concept, and match url, type to get the constrain coverage tree. We will execute deep searching for the tree. If the AttrC match the Filterc in the Constrain of the Filter in certain subscription successfully, the count of the Filter will increase 1. So will the rest Filters in the branch.
Step2: Check the count of Filters. If the count of one Filter equals to the total number of the Filterc in Constrains it includes or the count equals to the number of AttrC in Attribute the event owns. The Filter's concept part matches the event. And the unmatched ones will be eliminated.
Step3: Get the diverse property of each Attribute in event E, match the subscriptions left like step1 and count.
Step4: Check the count of the Filter. If the number is equals to the number of Attrp of all property in event E or Filterp of all property in Filter, the subscription having the Filter matches the event.
IV. Experiments A. Implement and Methodology
To study whether the event-matching support semantic and its effectiveness, we implement the two approaches in prototype Pub/Sub systems: our approach based on ontology and traditional counting algorithm without ontology supporting. We evaluate the approach from two sides: the rate of correction of the event-matching, and the average matching time.
We use machine running at 1G DDRII, 1.8GH, windows vista. And the approach implements in JDK1.5.
In the experiment, we create an ontology library exists 3 relationship (rfds:subClassOf, owl:equivalent and owl:attributeOf), 20 concepts and 15 attributes. In the library, the semantic information of the concept and attribute will increase in order to observe the influence of the correction rate of the event-matching. At the same time, we will observe the average matching time with the increase of number of subscriptions in order to evaluate the efficiency of the approach. The default parameter is shown in the table 1.
Figure3. The Structure of Event Matching Approach
Proceedings of 2009 4th International Conference on Computer Science & Education
1228 978-1-4244-3521-0/09/$25.00 ©2009 IEEE
TABLE1 EXPERIMENT DEFAULT PARAMETER
Parameter Default value Description Ontology (35,3) 35concepts and attributes, 3
relationships Numsub (1000,15000) Number of subscription
Numsem_info (0,35) Semantic information volume op <,=,> Op in Original data model
type String,Float,int Type in Original data model
Numsub_constrain 50 Number of constrain condition in per
subscription
Numevent_attr 50 Number of attribute in per event
B. Experiment Result
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35
semantic information volume
correction rate(%)
semantic
traditionalcounting
0
50
100
150
200
250
1000
3000
5000
7000
9000
11000
13000
15000
number of subscription
mat
chin
g tim
e(m
s)
semantic
traditionalcounting
Fig4 (a) shows the influence on the correction rate of event matching with the increase of semantic information volume (e.g. the increase of concept or property information). From the result, we can conclude that the integrity of the ontology is a significant factor influencing the correction rate. Without the ontology information, the correction rate of the traditional counting algorithm and our approach is nearly the same. And with the increase of semantic relationship, especially the increase of the synonym relationship, the tendency of the
correction rate raises. But in the process of increase, a reverse tendency advents when more domain information increases, because when the information of context is tiny, it’s hard to map into the correct domain, and it derives some misunderstanding information so that the correction rate falls slimly. And with the integrity of semantic relationship, the correction rate tends to be steady at 62% in this case. So our approach can deal with semantic heterogeneity to some extent, especially for synonym, but the effect on supporting context and attribute relationship just ordinary.
Fig4 (b) shows the time taken on event-match in the same semantic information volume with the increase of subscriptions in the range from 1000 to 15000. In the process, we compare our approach with traditional counting algorithm. The result presents the time spending on our approach, which doesn’t include the time of the conversion of subscriptions and communication, tends increase and the increase tendency grows slightly with the raise of the subscriptions number. In the beginning, the time mainly spends on the conversion of the event model, about 4ms for 50 Attribute. In a whole, when the number of subscriptions reaches a large amount, the time taken on our approach is less than that on traditional counting algorithm. But using multidimensional-index will sacrifice the more space.
V. Conclusions This paper introduced ontology on the base of the existing
content-based Pub/Sub systems. We built an ontology library on high level, and mapping the original data model to a new one which supported concept through applying multidimensional-index(X tree), and propose an event-matching approach based on the coverage of subscriptions and matching order using tree structure to enhance the efficiency. The experiment shows our approach can deal with semantic heterogeneity to some extent, and are of good matching efficiency. Now, with the improvement of semantic web technology, in the future work, we hope our ontology module can support more semantic relationship, solve the format issue in semantic heterogeneity and the procedure of matching can be simplified further in order to provide better matching efficiency.
References [1] Y.Tock, N.Naaman, A.Harpaz, and G.Gershinsky. Hierarchical
clustering of message flows in a multicast data dissemination system. In17th IASTED Int'l Conf. Parallel and Distributed, 2005
[2] H. Liu, V. Ramasubramanian, and E. G. Sirer. Client behavior and feed characteristics of rss, a publish-subscribe system for web micronews. InIMC, 2005
[3] Carzaniga A, Rosenblum DS, Wolf AL. Design and evaluation of a wide-area event notification service. ACM Trans. on Computer Systems, 2001
[4] Segall B, Arnold D. Content based routing with elvin4. In: Proc. of the Australian UNIX and Open Systems User Group Conference (AUUG2K). Canberra, Australian, Jun 2000
[5] IBM Corporation. Gryphon: Publish/subscribe over public networks. Technical report, IBM T. J. Watson Research Center, 2001. http://www.research.ibm.com/gryphon/papers/Gryphon-Overview.pdf
[6] IBM RedBook. Internet Application Development with MQSeries and Java. February 1997. http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/sg244896.html
(b) Matching time in different number of subscription
Figure4. Experiment result
(a) Correction rate in different semantic information volume
Proceedings of 2009 4th International Conference on Computer Science & Education
978-1-4244-3521-0/09/$25.00 ©2009 IEEE 1229
[7] Rowstron A, Kermarrec AM. SCRIBE: The design of a large-scale event notification infrastructure. Proc. Of the 3rd Int’l Workshop on Networked Group Communication. 2001.
[8] Antonio Carzaniga, David S. Rosenblum, and Alexander L.Wolf. Design and Evaluation of a Wide-Area Event Notification Service. ACMTransactions on Computer Systems, 19(3):332–383, August 2001.
[9] Sasu Tarkoma, Jaakko Kangasharju. Filter Merging for Efficient Information Dissemination. In 13th International Conference on Cooperative Information Systems, October, 2005.
[10] M.K.Aguilera,R .E .Strom,D .C .Sturman,M .Astley,et al.Matching Events in a Content-Based Subscription System. In Proceedings of the eighteenth annual ACM symposium on Principles of distributed computing, pp. 53-61,1999.
[11] k.J.Gough and G.Smith,"Efficientre cognition of events in distributed systems", In Proceedings of the 18th Australasian Computer Science Conference(ACSC18),1995.
[12] A.Campailla, S.Chaki, E.Clarke, S.Jha, and H.Veith. Efficient Filtering in Publish-Subscribe Systems using Binary Decision Diagrams. In Proceedings of the 23th Intenrational Conference on Software Engineering, p:443-452,May,2001.
[13] J. Pereira,F. Fabret,F. Llirbat,an dD .Shasha.Efficient matching for web-based publish/subscribe systems.In CooPIS,2001.
[14] Xiangfeng Guo, Jun Wei. Efficient Event Matching in Publish/subscribe: Based on Routing Destination and Matching History. International Conference on Networking, 2008
[15] Gruber T R. A translation approach to portable ontologies[J]. Knowledge Acquisition, 5(2): 199-220, 1993
[16] Jinlin Wang. The research of key technologies in Internet-oriented publish / subscribe system, 2005
[17] DongCai Shi. The research of key technologies in P2P-Network based semantic publish / subscribe system, Zhejiang Universiy, 2007
[18] M. . . .Altinel,M J Franklin Emcient Filtering of XML Documents for Selective Dissemination of Information. In Proceedings of the 26th International Conference on Very Large Data Bases,2000
[19] Lu jianjiang, Zhang Yafei. Sementic Web principle and technology. Science Press, Shanghai,2007, p66-8
[20] H.MAnnila, H.Toivonen, and A.I.Verkamo. Efficent algorithms for discovering association rules. In Proc.AAAI’94 Workshop Knowledge Discovery in Databases(KDD’94),page 181-192,Seattle,WA,July 1994
[21] Berchtold S, ,Keim D Kriegel H-P.The X—tree: an index structure for high--dimensional data.Proeeedings of the 22nd Intemational Conference on Very Larse Databases. :1996 28—39
[22] [Wang Jinling, Jin Beihong. Data Model and Matching Algorithm in an Ontology-Based Publish/Subscribe System. Journal of Software. Vol.16, No.9:p1625-1636, 2005
Proceedings of 2009 4th International Conference on Computer Science & Education
1230 978-1-4244-3521-0/09/$25.00 ©2009 IEEE