multiagent data warehousing (madw) and multiagent data...

Multiagent Data Warehousing (MADW) and Multiagent Data Mining (MADM)

Proceedings of a Workshop held in Conjunction with 2005 IEEE International Conference on Data Mining

Houston, USA, November 27, 2005

Edited by Wen-Ran Zhang

Georgia Southern University, USA Yanqing Zhang

Georgia State University, USA Xiaohua (Tony) Hu

Drexel University, USA

ISBN 0-9738918-0-7

Multiagent Data Warehousing (MADW) and Multiagent Data Mining (MADM)

Proceedings of a Workshop held in Conjunction with 2005 IEEE International Conference on Data Mining

Houston, USA, November 27, 2005

Edited by Wen-Ran Zhang

Georgia Southern University, USA Yanqing Zhang

Georgia State University, USA Xiaohua (Tony) Hu

Drexel University, USA

The papers appearing in this book reflect the authors’ opinions and are published in the interests of timely dissemination based on review by the program committee or volume editors. Their inclusion in this publication does not necessarily constitute endorsement by the editors. ©2005 by the authors and editors of this book. No part of this work can be reproduced without permission except as indicated by the “Fair Use” clause of the copyright law. Passages, images, or ideas taken from this work must be properly credited in any written or published materials. ISBN 0-9738918-0-7 Printed by Saint Mary’s University, Canada.

i

CONTENTS Workshop Committee ……………………………………………………………… ii Forward ………………………………………………………………………………. iii A Multiagent Framework to Integrate and Visualize Gene Expression Information Li Jin, Karl V. Steiner, Carl J. Schmidt, Gang Situ, Sachin Kamboj, Kay T. Hlaing, Morgan Conner, Heebal Kim, Marlene Emara, and Keith S. Decker University of Delaware, USA…………………..………..............……………..............……………....1

Concepts, Challenges, and Prospects on Multiagent Data Warehousing and Multiagent Data Mining Wen-Ran Zhang, Dept. of Computer Science College of Information Technology, Georgia Southern University, USA ……………..............……8

Multi-Party Sequential Pattern Mining Over Private Data Justin Zhan1, LiWu Chang2, and Stan Matwin3

1,3School of Information Technology & Engineering, University of Ottawa, Canada 2Center for High Assurance Computer Systems, Naval Research Laboratory, USA ……….......18

Privacy-Preserving Decision Tree Classification Over Vertically Partitioned Data Justin Zhan1, Stan Matwin2, and LiWu Chang3

1,2School of Information Technology & Engineering, University of Ottawa, Canada 3Center for High Assurance Computer Systems, Naval Research Laboratory, USA………........27

Data Mining for Adaptive Web Cache Maintenance Sujaa Rani Mohan, E. K. Park, and Yijie Han, University of Missouri, Kansas City, USA……….36

Temporal Intelligence for Multiagent Data Mining in Wireless Sensor Networks Sungrae Cho, Ardian Greca, Yuming Li, and Wen-Ran Zhang Department of Computer Science, Georgia Southern University, USA……………………………44

A Schema of Multiagent Negative Data Mining Fuhua Jiang, Yan-Qing Zhang, and A.P. Preethy Dept. of Computer Science, Georgia State University, USA ………………………………............49

Distributed Multi-Agent Knowledge Space (DMAKS): A Knowledge Framework Based on MADWH Adrian Gardiner, Dept. of Information Systems, Georgia Southern University, USA…................53

Applying MultiAgent Technology for Distributed Geospatial Information Services Naijun Zhou 1 and Lixin Li 2 1 Department of Geography, University of Maryland - College Park 2 Department of Computer Sciences, Georgia Southern University ……………………….………62

ii

Workshop Committee

Website: http://tinman.cs.gsu.edu/~cscyntx/ ICDM-MADW-MADM2005.htm

Honorary Chair M. N. Huhns, University of South Carolina, USA

Workshop Organizers and Program Committee Co-Chairs

Wen-Ran Zhang Georgia Southern University, USA

[email protected]

Yan-Qing Zhang Georgia State University, USA

[email protected]

Xiaohua Tony Hu Drexel University, USA

[email protected]

Publicity Chair Yuchun Tang, Georgia State University, USA, [email protected]

Program Committee Ajith Abraham (Chung-Ang University, Korea) Nick Cercone (Dalhousie University, Canada) Sungrae Cho (Georgia Southern University)

Diane J. Cook (University of Texas – Arlington, USA) Dejing Dou (University of Oregon, USA)

Xiaohua Tony Hu (Drexel University, USA) Mark Last (Ben-Gurion University of the Negev, USA)

Vincenzo Loia (Universita di Salerno, Italy) Yi Pan (Georgia State University, USA) Ziyong Pen (Wuhan University, China)

Zhongzhi Shi (Chinese Academy of Science, China) Il-Yeol Song (Drexel University, USA)

Raj Sunderraman (Georgia State University, USA) Yong Tang (Zhongshan University, China)

David Taniar (Monash University, Australia) Juan Vargas (University of South Carolina, USA)

Feiyue Wang (University of Arizona, USA) Xindong Wu (University of Vermont, USA)

John Yen (Pennsylvania State University, USA) Hao Ying (Wayne State University, USA)

Wen-Ran Zhang (Georgia Southern University, USA) Yan-Qing Zhang (Georgia State University, USA)

Yi Zhang (Univ. of Electronic Science and Technology of China) Ning Zhong (Maebashi Institute of Technology, Japan)

iii

Forward 2005 IEEE-ICDM Workshop on Multiagent Data Warehousing and Multiagent Data Mining

(MADW/MADM-2005) is the first international workshop of its kind. This workshop is to bring together researchers from diverse areas including data mining, data warehousing, multiagent systems, artificial intelligence, computational intelligence, machine learning, robot control, knowledge management, bioinformatics, neuroscience, and other related areas to layout the foundation for MADW/MADM.

Biological systems such as brains have enormous capabilities in information processing and coordinated knowledge discovery. One challenging issue facing data mining and knowledge discovery today is understanding how the enormous amount of radio, audio, spacio-temporal, and bio-information is processed by the massive number of neural or genetic agents of biological systems and how multiple agents can be coordinated for information processing and knowledge discovery at the micro and/or macro system levels. MADWH/MADM can be considered a YinYang pair where the Yin is internal centralization that promotes coordinated computational intelligence (CCI), and the Yang is external distribution that promotes distributed artificial intelligence (DAI). The two sides coexist and re-enforce each other in knowledge discovery. Many disciplines including computational intelligence and artificial intelligence can join forces on the common platform of MADWH/MADM.

Technical issues include (but not limited to) Necessity, applicability, and feasibility analysis for MADW/MADM in different domains; Coordinated computational intelligence (CCI) and distributed artificial intelligence (DAI); Dimension analysis, algorithms, and methods for MADM/MADW; Multiagent data mining (MADM) vs. multirelational data mining (MRDM); Agent cuboids, schemas, and architectures of MADW; Query languages for OLAP and OLAM with MADW and MADM; Mining agent association rules in first-order logic; Coordination protocols for collaborative knowledge discovery with MADW/MADM; Agent discovery, law discovery, self-organization, and reorganization; Full autonomy as a result of coordination of semiautonomous agents; Reinforced knowledge discovery with the interplay of MADW/MADM; Agent similarity and orthogonal MADW; MADW/MADM for brain modeling; MADW/MADM for applications in security/privacy, bioinformatics, biomedicine, semantic Web, e-

business, Web service, Web mining, grids, wireless networks, mobile networks, Ad hoc networks, sensor networks, flexible engineering, robot learning/control, knowledge management, geographical information systems, and other suitable domains.

In this proceedings we included nine papers in the following categories: Concepts-Challenges- Prospects; Application in Bioinformatics; Privacy/Security in Web-Based MADM; Adaptive Web-Cache for MADM; Temporal Intelligence for MADM in Wireless Sensor Networks; MADM for Machine Learning; MADM from Geographical Databases; and MADW/MADM for Knowledge Management.

This workshop marks the birth of a new research area. The short term potential of this emerging area lies in multidimensional reorganizable agent-oriented OLAP and OLAM in business, engineering, and biomedical applications; its long term impact can be far-reaching in scientific discoveries especially in knowledge discovery about bio-agents, agent associations, agent organizations, and natural laws.

Our thanks go to the authors, program committee members, Publicity Chair Yuchun Tang, Honorary Chair Michael N. Huhns, and IEEE ICDM05 Workshops Chair Pawan Lingras for their contributions, services, and support.

Co-Organizers: Wen-Ran Zhang Yanqing Zhang Xiaohua (Tony) Hu

A Multiagent Framework to Integrate and Visualize Gene Expression Information

Li Jin1, 2, *, Karl V. Steiner 2, 3, *, Carl J. Schmidt 4, Gang Situ1, Sachin Kamboj1, Kay T. Hlaing1, Morgan Conner1, Heebal Kim4, Marlene Emara4, and Keith S. Decker1, *

1Department of Computer Information Sciences, 2Delaware Biotechnology Institute, 3Department of Electrical and Computer Engineering, 4Department of Animal and Food

Sciences, University of Delaware, Newark, DE, 19711 *Email: 1{jin, decker}@cis.udel.edu, [email protected]

Abstract With rapidly growing amounts of genomic and

expression data publicly available, efficient and automated analysis tools are increasingly important for biologists to derive knowledge for a large variety of organisms. Multiagent information gathering methods can be used to retrieve and integrate genomic and expression information from available databases or web sources to generate knowledge databases for organisms of interest. We present a novel, flexible and generalizable bioinformatics multiagent system, called BioMAS, which in this paper is used to gather and annotate genomic data from various databases and web sources and analyze the expression of gene products for a given organism. In this paper, we also present a new approach to visualize complex datasets representing gene expression and pathway models in hierarchical view space using the Starlight information visualization system (Starlight). This approach is an innovative application of using Starlight in the field of comparative genomics. 1. Introduction 1.1. Overview

Since the successful completion of the human

genome project in 2003 [1], not only genomic data but also proteomic and expression data of numerous species have been published. Homologs, which are homologous genes that have common origins and share an arbitrary threshold level of similarity determined by alignment of matching bases [2], play an important role in predicting gene products by using the annotation in public databases. Therefore, sequences of different organisms are compared for clues about gene function.

Public databases, such as KEGG [3], GenBank [4], SwissProt [5], and Ensembl [6], provide a huge data resource for genomic information, metabolic pathways and gene expression that can be utilized to study sequences, gene expression, or pathways of close species by sequence and functional annotations [7]. However, these public databases are heterogeneous and constantly updated, and new data sources are constantly published on-line. Therefore, a multiagent information gathering system, which satisfies the following basic requirements, should be helpful to advance biological research.

(1) It can retrieve and integrate sequence and function information from distributed, heterogeneous and dynamic updated databases.

(2) New agents for new data sources or new analysis services can be added easily in the future.

(3) It should be easily adjustable for a variety of different organisms.

In this paper, we present a general bioinformatics multiagent system, called BioMAS, which meets these three basic requirements.

Biologists are familiar with several visual representations of metabolic pathways. Visualizations may aid in the understanding of the complex relationships between the pathway components, to extract important information, and to compare pathways between different organisms. Our goal is not only to retrieve and integrate gene expression information from public data sources into our knowledge databases but also to load all the gene expression information into an information visualization system such that relationships between the datasets can be visualized easily. Our approach is to utilize the Starlight information visualization system [8] to visually analyze organism-specific data within a three-dimensional hierarchical view with existing

2005 IEEE ICDM Workshop on MADW & MADM 1

pathway diagrams and chromosome diagrams. Starlight was initially created as a visualization system for the military intelligence community with features such as query tools for data and data mining, references tagged images and diagrams [9]. This paper describes an innovative application of this program to the field of bioinformatics.

In the rest of this paper, we will first review existing approaches for retrieving, integrating, and visualizing biological data. Then we will discuss the details of processing of the gene expression component in BioMAS because most of the other components of BioMAS have previously been described in [10]. Next, the visualization of the gene expression data utilizing Starlight will be presented. The results of knowledge about the organism generated from different sources are combined into one visual space to make it convenient for biologists to compare genomic data across different organisms. As a demonstration we will present the visualization of the gene expression comparison of two different organisms – chicken (Gallus Gallus) and human (Homo Sapiens). Finally, we will discuss the advantages and limitations of our methods and future work.

1.2. Related work

BioMAS is a bioinformatics multiagent system with

an increasing number of functions and dedicated agents. The work presented in this paper is a new gene expression processing organization in BioMAS, whose previous work included, (1) basic annotation and query agent organization (2) functional annotation agent organization and (3) EST (Expressed Sequence Tags) processing agent organization [10]. There are several systems available that are aimed at retrieving, integrating, annotating and visualizing biological data. Our system differs from these systems in that our system can respond to the dynamic changes of data sources while other systems such as TSIMMIS [11] or InfoSleuth [12] cannot respond to dynamic changes in data sources. New sources and analysis methods can be easily integrated into our system while GeneWeaver [13] is not based on a shared architecture that supports reasoning about secondary user utility [10]. Our approach is not only to retrieve and integrate related gene expression information into databases and publish the information on-line but also to provide a novel method to visualize the relationships among the information utilizing Starlight. There have been several related studies to visualize gene expression data onto metabolic networks, such as KEGG, Expasy [14], and EcoCyc [15]. The type of pathway visualization these

web sites provide is static and predefined, i.e. users can obtain the information only by following the hyperlinks embedded in search results. For a dynamic visualization, considerable research effort has been focused on implementing new visualization tools to redraw pathway networks and interactions among proteins automatically. Typical applications are the BioMaze project [16] and the Visant visualization tool [17]. Although virtual reality tools also provide a way to understand metabolic networks and gene expression [18], visually representing the relationships among the multidimensional information is a complex task. A critical advantage of Starlight is that it can combine data and images such as pathway diagrams into one visual space to enable users to see the position of interesting data on the embedded images.

2. System and methods

The architecture overview of our system is illustrated in Figure 1, using the Gallus Knowledge Base (GallusKB) as an example. The Gallus information is retrieved from distributed sources and integrated into GallusKB using BioMAS. Then the gene expression and pathway data are parsed into XML files as required by Starlight. Finally, user queries can be processed interactively by Starlight and results are presented to the user in a hierarchical structure. This enables users to explore the result space by following associated relationships. The gene product pathways and chromosome positions can be displayed in both the pathway diagram and the chromosome diagram in the Starlight view space.

2.1. Information retrieval and integration

BioMAS is composed of five groups of agent

organizations, including (1) Sequence Annotation Agents, which integrate gene sequence annotations from various sources, such as NCBI databases, Protein Domains [19], PSort [20], and SwissProt [5], (2) EST Processing Agents, which are responsible for building chicken contigs (contiguous sequences) and saving contigs into databases, (3) Functional Annotation Agents, which annotate the function of a gene using Gene Ontology [21] and MeSH (Medical Subject Heading) [22] terms, (4) Query Agents to facilitate queries from users, and (5) Pathway Agents for downloading of human pathway information from KEGG, predicting pathways for a given organism, and predicting organs where gene products are expressed.


Figure 1. Architecture of knowledge information gathering and visualization.

The organizations outlined in (1)-(4) were described previously in [10, 23]. Therefore, this paper only focuses on Pathway Agents.

The Kyoto Encyclopedia of Genes and Genomes (KEGG) provides a web-accessible database of human pathways, which can also be used to predict chicken pathways by blasting the human gene sequence against the chicken contig database. As shown in Figure 1, the Pathway Agents of BioMAS are used to retrieve the KEGG pathway data for human genes and place them in the Pathway Database (Pathway DB) and to save all human pathway information in a human pathway table. The Chicken Contigs Database (Chicken Contigs DB) holds 30,214 chicken contigs. BlastX [24] is used to blast the retrieved human gene sequence against Chicken Contigs DB with an E-Value cutoff =1x10–5

to identify the chicken gene products involved in the individual pathways as annotated by KEGG. All the chicken gene products identified are saved into a chicken pathway table under Pathway DB. In the next step, the pathway maps with highlighted chicken homologs are generated by the Pathway Image Processing Agent. As an example, http://udgenome.ags.udel.edu/gallus/pathway/hsa00010.php shows the pathway map of Glycolysis /

Gluconeogenesis of Homo Sapiens where the highlights indicate matches between chicken homologs and human genes. The human gene and chicken contigs are mapped to chromosomal locations using the genomic sequence as determined by Washington University [25] and annotated using Ensembl [6]. In this way the relationship between chromosome positions of human and chicken genes involved in the same pathway can be compared visually. A Java parser is used to transform the data in Pathway DB to XML format data, which serves as the input for Starlight. Users can then use Starlight to conduct data mining and data analysis. 2.2. Information visualization

A significant effort has been focused on dynamically generating biochemical network diagrams. The manually generated KEGG static diagram is a popular resource commonly used by biologist. However, to efficiently explore genomic databases, such as Pathway DB, a sophisticated tool is needed to select, search, navigate and analyze the database visually, especially for the comparison of the metabolic pathways in different organisms. With the GallusKB, our intention is to analyze the

Distributed databases & on-line sources

KEGG DB

BioMAS GallusKB

Chicken Contigs DB

EST Processing Agents

Functional Annotation Agents

Sequence Annotation Agents

Pathway Agents

Information Extraction Agent

BlastX

Pathway Image Processing Agent

Pathway DB

Human Pathway Table

Chicken Pathway Table

Pathway Images Table

Starlight Visualization System User

PSort

Protein Domains

NCBI DB

SWISSPROT

Gene ontology

Ensemble DB

MeSH Terms

Java parser


similarities and differences in metabolic pathways and gene expression between human and chicken. In addition, the corresponding chromosome positions of human genes and their chicken homologs is of interest. For this paper, we use the hsa00010 pathway for Glycolysis / Gluconeogenesis in Homo sapiens provided by KEGG as an example. Our approach is described in detail below for the following aspects: the data model defining the input data, the view mode with the hierarchical view, navigation to query the database in 3D visualization space, and the data presentation.

2.2.1. Data model. The input data format is a flat XML file as outlined in the following example code in Figure 2 extracted from GallusKB. <RECORD> <Organ>Kidney</Organ> <pathway_id>hsa00010</pathway_id> <pathway_definition>Glycolysis / Gluconeogenesis - Homo sapiens</pathway_definition> <ratio>61/61</ratio> <pathway_figure>hsa00010.gif</pathway_figure> <enzyme_EC>ec:1.1.1.1</enzyme_EC> <hsa_gene_id>hsa:124</hsa_gene_id> <hsa_gene_definition>alcohol dehydrogenase 1A (class I), alpha polypeptide</hsa_gene_definition> <hsa_gene_aaseq> >hsa:124 ADH1A; alcohol dehydrogenase 1A (class I), alpha polypeptide [EC:1.1.1.1] (A) MSTAGKVIKCKAAVLWELKKPFSIEEVEVAPPKAHEVRIKMVAVGICGTDDHVVSGTMVT PLPVILGHEAAGIVESVGEGVTTVKPGDKVIPLAIPQCGKCRICKNPESNYCLKNDVSNP QGTLQDGTSRFTCRRKPIHHFLGISTFSQYTVVDENAVAKIDAASPLEKVCLIGCGFSTG YGSAVNVAKVTPGSTCAVFGLGGVGLSAIMGCKAAGAARIIAVDINKDKFAKAKELGATE CINPQDYKKPIQEVLKEMTDGGVDFSFEVIGRLDTMMASLLCCHEACGTSVIVGVPPDSQ NLSMNPMLLLTGRTWKGAILGGFKSKECVPKLVADFMAKKFSLDALITHVLPFEKINEGF DLLHSGKSIRTILMF </hsa_gene_aaseq> <human_chromosome_position>4q21-q23</human_chromosome_position> <CHK_Number>CHK124</CHK_Number> <Gene_name>ADH1. ADH1</Gene_name> <chick_chromosome_position>Chr4</chick_chromosome_position> <chick_contig>UD.GG.Contig26702 </chick_contig> <score>228</score> <e_value>7e-60</e_value> <link>http://udgenome.ags.udel.edu/gallus/displayReport.php?type=rep&name=GP&id=UD.GG.Contig26702</link> <pathway_enzyme_gene_chickcontig> Organ,Heart,hsa00010,ec:1.1.1.1,hsa:124,UD.GG.Contid26702 </pathway_enzyme_gene_chickcontig> </RECORD>

Figure 2. Flat XML format data of one chicken contig record.

2.2.2. Hierarchical view mode. Among the various view modes available in Starlight, the hierarchical view mode is selected for visual representation to explore the following research interests:

• Gene expression in several organs for chick contigs;

• Relationships between human genes, chicken homolog contigs and their respective chromosome positions;

• Comparison of human genes and chicken homolog contigs in KEGG pathways;

• Visually supported detection of chicken homologs for human gene products.

Figure 3. Hierarchical data structure.

Figure 3 shows the hierarchical data structure

defined for the visualization of data sets. By this hierarchical structure, we can study the relationships between organs, pathways, human genes and chicken contigs. We can also visually compare of human and chicken gene expressions with pathway diagrams and chromosome maps.

2.2.3. Navigation. With the hierarchy view mode, the dataset is displayed in 3-D view space as shown in Figure 4. As an example, this Figure presents a hierarchical view of pathway hsa00010 - Glycolysis / Gluconeogenesis - Homo Sapiens.

Figure 4. Hierarchical view of pathway hsa00010 -

Glycolysis / Gluconeogenesis - Homo sapiens.


The navigation through organs, pathways, enzymes, genes, and chicken contigs can be conducted by following the hierarchical tree, as shown in Figure 5. Interactive navigation is easily possible along the path of any organ. Figure 5 shows the result of zooming into the dataset for the heart, hsa00010, ec:1.1.1.1, hsa:124 and CHK:124. By selecting any record, the details of this record can be viewed. This hierarchical view shows how certain genes are expressed within a specific organ, and how the respective chicken homologs are expressed.

The details for each specific record can be called up as shown for the example in Figure 6, which is base don the UD.GG.Contig26702 of hsa:124. Each of these details can be published as an html page and can be published on a designated web site via XSLT.

Figure 5. Close-up view of heart data in

hierarchical view.

The hierarchy view mode makes it possible to visually query the gene data and compare gene expressions of both human and chicken. The field query generator can be used to create the query and query results can be presented in the link view of the hierarchical level. In this hierarchical view, any relationship query among data in different hierarchy levels can be conducted by selecting the data bar of interest in each level to highlight the resulting links as yellow lines. The corresponding results of the query can be displayed in a 3D environment. Figure 7 presents the hierarchical query view for the kidney. All records expressed in the kidney and their related data are linked with yellow lines between levels.

Figure 7. Hierarchical view of gene expression in the kidney. Yellow lines connect all data points

related to the kidney.

Figure 6. Details for item hsa:124 – UD.GG.Contig26702.


Figure 8. 3D view of pathway, human and chicken chromosome maps highlighting

the position mapping of genes expressed in the kidney. One of the unique features of Starlight is its

MetaImage tool, where image files can be processed and graphically linked to correlated information within a dataset. Here the Starlight MetaImage tool is used to process pathway and chromosome diagrams such that these diagrams can be visually linked to genetic information in our approach to support genomic visualization. Figure 8 presents a 3-D view of the data with gene, pathway, and chromosome position mapping of genes expressed in the kidney. Pathway diagrams and chromosome images are embedded into the same view space allowing for simultaneous exploration of these data. In Figure 8, image 1 (left) is the chicken chromosome map, image 2 (center) is the human chromosome map, and image 3 (right) is the pathway map of Glycolysis/Gluconeogenesis of Homo Sapiens. The yellow lines highlight where the chicken homologs are matched with the human genes.

3. Results and discussion

BioMAS is a flexible and general multiagent system, which can work for different organisms of interest. The Gallus Knowledge Base and Fungi Knowledge Base (http://udgenome.ags.udel.edu) have been generated using BioMAS. GallusKB has a total of 30,214 chicken contigs in the database and chicken gene products have been predicted to be involved in a total of 140 pathways based on annotations in the KEGG database. 5851 human proteins in the KEGG database were used to search GallusKB, and 5437 (92%) Gallus gene products were found to have associated pathways. Through a

java parser, data in the databases can be formatted into XML and entered into Starlight. Since images can be loaded into the data view space of Starlight, these images can be very useful for a better understanding of the data. As shown in Figure 8, traditional static pathway diagrams and chromosome maps can be linked to data sets as well. Data can be presented to the user in a hierarchical view to allow the user to explore the view space through the associated relationships while navigating the data sets. The examples used in Section 2.2 show preliminary results of applying Starlight for a genomics evaluation. 4. Conclusion and future work

In this paper, we have presented a multiagent approach to retrieve, integrate and analyze gene expression information and a new approach to visualize gene expressions by using the Starlight information visualization system. With the ability to explore information in the publicly available genetic domain on the web, it is essential to develop increasingly powerful tools to retrieve the target information from different bioinformatics resources and to visualize this information efficiently. Our approach is to apply a multi-agent system (BioMAS) to retrieve data from separate resources and subsequently utilize the Starlight system to visualize correlations between the datasets.

The pathway agent of BioMAS can automatically update the human and chicken data in the pathway database whenever the data in the external resources are changing. Currently, the view space in Starlight


has to be updated manually. However, this could be improved in future work by using a trigger to update the view space automatically whenever the external source data change. The MetaImage tool provided within Starlight can create the coordination of data with relevant images. This feature is not automated yet either, which can lead to significant lead time while processing large amounts of similar images. Therefore, this is another focus of future work. The present data format is in flat XML format and some work should be invested to use a database as the data format of choice. The preliminary results obtained to date have been encouraging in outlining the potential for new visualization techniques to interactively explore genomic and pathway databases.

5. Acknowledgements

This publication was partially supported by awards from the National Science Foundation (NSF 0092336), the US Department of Agriculture (99-35205-8228), and the National Center for Research Resource at the National Institutes of Health (2 P20 RR016472-04) under the INBRE program.

6. References [1] Human Genome Project, http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml [2] Jackson, J.H., Terminologies for Gene & Protein Similarity, Technical Reports & Reviews No. TR 99-01 Michigan State University, http://www.msu.edu/~jhjacksn/Reports/similarity.htm [3] KEGG: Kyoto Encyclopedia of Genes and Genomes, Kanehisa Laboratory. http://www.genome.ad.jp/kegg/ [4] GenBank, NIH genetic sequence database, http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html [5] Swiss-Prot Protein knowledgebase, TrEMBL Computer-annotated supplement to Swiss-Prot; http://us.expasy.org/sprot [6] Ensembl Genome Browser, Sanger Institute, The Wellcome Trust, http://www.ensembl.org/ [7] Benson D.A. et al. Genbank. Nucleic Acids Res., 28:15–18, 2000. http://www.ncbi.nlm.nih.gov. [8] Starlight Information Visualization System, Pacific Northwest National Laboratory, http://starlight.pnl.gov/ [9] Kritzstein, B., Starlight, Military Geospatial Technology Online Archives Volume: 1 Issue: 1, http://www.military-geospatial-technology.com/article.cfm?DocID=339 [10] Decker, K., Khan, S., Schmidt, C., Situ, G., Makkena, R., Michaud, D., Biomas: A multi-agent system for genomic annotation International Journal of Cooperative Information Systems, 11 (3-4): 265-292, 2002

[11] Chawathe, S., H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J.Widom, The TSIMMIS project: integration of heterogeneous information sources. In Proceedings of the Tenth Anniversary Meeting of the Information Processing Society of Japan, December 1994. [12] Nodine, M. and A. Unruh. Facilitating open communication in agent systems: the infosleuth infrastructure. In M. Singh, A. Rao, and M. Wooldridge, editors, Intelligent Agents IV, pages 281–295. Springer-Verlag, 1998. [13] Bryson, K., M. Luck, M. Joy, and D.T. Jones. Applying agents to bioinformatics in geneweaver. In Proceedings of the Fourth International Workshop on Collaborative Information Agents, 2000. [14] ExPASy Proteomics Server, Swiss Institute of Bioinformatics, http://us.expasy.org/ [15] Encyclopedia of Escherichia coli K12 Genes and Metabolism, http://ecocyc.org/ [16] Zimányi,E., S. Skhiri dit Gabouje, Semantic Visualization of Biochemical Databases In Semantics for GRID Databases: Proc. of the Int. Conf. on Semantics for a Networked World, ICSNW2004, Paris, France, June 2004. [17] Hu, Z., Joseph Mellor, Jie Wu, Charles DeLisi. VisANT: an online visualization and analysis tool for biological interaction data. BMC Bioinformatics. 2004; 5 (1): 17 [18] Dickerson, J.A., Y. Yang, K. Blom, Using Virtual Reality to Understand Complex Metabolic Networks, Atlantic Symposium Comp Biol Genomic Info Systems Technol September. 950-953 [19] Servant, F., Bru, C., Carrere, S., Courcelle, E., Gouzy, J., Peyruc, D., and Kahn, D., ProDom: automated clustering of homologous domains. Brief Bioinform 3(3), 246-51, 2002. [20] PSort: subcellular localization prediction, Brinkman Laboratory, Simon Fraser University, http://www.psort.org/ [21] Ashburner, M., and Lewis, S., On ontologies for biologists: the Gene Ontology--untangling the web. Novartis Found Symp 247, 66-80; discussion 80-3, 84-90, 244-52, 2002. [22] Medical Subject Headings, http://www.nlm.nih.gov/mesh/meshhome.html [23] Decker, K., S. Khan, C. Schmidt, D. Michaud, Extending a Multi-Agent System for Genomic Annotation. Proceedings of the Fifth International Workshop on Cooperative Information Agents, Modena, September 2001. LNAI 2182, Springer-Verlag, 2001. [24] Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J., Basic local alignment search tool. J Mol Biol 215(3), 403-10, 1990. [25] Contiguous Chromosomal Chicken Sequences Sequencing Center, Washington University, St. Louis, http://genome.wustl.edu/projects/chicken/index.php?unmasked=1


Concepts, Challenges, and Prospects on Multiagent Data Warehousing and Multiagent Data Mining

Wen-Ran Zhang

Dept. of Computer Science, Georgia Southern University Statesboro, Georgia 30460-7997, [email protected]

Phone: (912)486-7198, Fax: (912)486-7672

Abstract A hierarchical architecture has been a dominat-

ing model for brain and behavior research for many years. Unfortunately, a hierarchical struc-ture along is too simplistic to be realistic for orga-nizing many billions of neurons and genetic or biological agents into an autonomous system for high level cognition. MADWH/MADM presents a multidimensional agent-oriented approach for brain modeling and decision making based on the hypothesis that a brain system consists of a soci-ety of semiautonomous neural agents and full autonomy is the result of coordination of semiau-tonomous functionalities. The agent-oriented ap-proach leads to the following concepts, chal-lenges, and prospects: (1) agent laws, agentiza-tion, agent discovery, law discovery, self-organization, and reorganization; (2) mining agent association rules in 1st-order logic; (3) modeling full autonomy as the result of coordination of semiautonomous agents; (4) modeling evolving processes like growing and aging; and (5) model-ing healthy states as well as unhealthy states of biological systems. The short term potential of MADWH/MADM is in its commercial values in mul-tidimensional multiagent OLAP and OLAM in business, engineering, and biomedical applica-tions. As a platform for scientific discoveries, its long term impact can be far-reaching especially in knowledge discovery about bio-agents.

Keyword: Multidimensional Agent-Orientated Brain Modeling; MADWH and MADM; Intermediate Agent Law; Agentization; Agent Association, Concepts-Challenges-Prospects

1. Introduction Some abbreviations are listed in the following:

MADWH – MultiAgent Data WareHouse or Multi-agent Data Warehousing [13,14,15]

MADM – Multiagent Data Mining [13,14,15].

VA – Virtual Agent [16].

VC – Virtual Community [16].

CI – Computational Intelligence [2];

AI – Artificial Intelligence.

CCI – Coordinated CI [10,11,12].

DAI – Distributed AI [3,5].

MAS – MultiAgent System [3,6]

MAC – MultiAgent Cerebrum and/or Cerebellum [10,11, 12].

The term of MADWH/MADM as a total package is first proposed in [13,14] and refined in [15]. It is originally used for brain modeling and neurofuzzy control based on the work in [10,11,12]. Here the concept of MADWH is generalized to a data/knowledge system that allows the warehousing of “agentwares” including agent specification, characterization, architecture, knowledge, associations, organizations, or even the agent itself as well as the data or memory associated to it. An agent here is a data “miner” that is either an autonomous, semiautonomous, computational, virtual or bio agent.

While a traditional data warehouse is for business decision support, a MADWH is for agent modeling and knowledge discovery as well as business and engineering decision support. Although both are integrated, time-variant, and nonvolatile, a traditional data warehouse is subject-oriented and data-based; a MADWH is agent-oriented and agent-based.

MADWH provides a centralization for reinforced MADM; and MADM provides a distribution of learning activities that can further develop a MADWH. The two together form a YinYang pair for equilibrium or harmony in learning and decisioin. The agents or miners envolved in MADWH/MADM can be heterogeneous or homogeneous. Heterogeneous agents perform different functionalities. Homoge-neous agents perform the same functionalities.

Biological systems such as brains have enormous capabilities in information processing and knowledge discovery including information storage, retrieval, sensor fusion, visualization, cognition, and recognition with spacio-temporal patterns. One challenging issue


facing machine learning and knowledge discovery today is understanding how the enormous amount of radio, audio, bio- and spacio-temporal data is processed and how knowledge networks are organized in the brain, and how a brain system can act as a coordinator of multiple miners for data mining and knowledge discovery.

For many decades, a hierarchical architecture has been the dominating model in brain related research. Unfortunately, a hierarchical structure along is too simplistic to be realistic for organizing many billions of neurons into an autonomous system for high level cognition. Multiagent data warehousing (MADWH) and multiagent data mining (MADM) provides an alternative approach to brain modeling [10-15]. The new approach is based on a multidimensional agent orientation where agent similarity, agent cuboids, agent community, orthogonality, and reorganization are some basic concepts.

In MADWH/MADM the brain is considered a society of semiautonomous neural or genetic agents where full autonomy is the result of coordination of semiautonomous functionalities and learning is accomplished with multidimensional and multiagent data mining. It is shown in [10-15] that coordinated knowledge discovery is possible in an evolving dynamic environment with a large number of autonomous or semiautonomous neural agents as “actors” and agent actions as “transactions”.

Different from the Apriori algorithm [1] where frequency is used as a priori threshold for mining item associations, MADWH/MADM uses agent similarity as a priori threshold for discovering agent associations in first order logic that was once considered impossible in traditional data mining. Different from multirelational data mining (MRDM) [7,8], MADM does not assume a static data source or data stream. Instead, relevant data is like mineral deposit that is to be located through coordinated multiagent exploration or “data outcropping” from an uncertain and dynamic environment before knowledge discovery. A MADWH (or a multiagent data mart) provides a brain model for the coordination of data outcopping and mining [15].

Multidimensional agent orientation in MADWH and MADM provides a joint platform for many areas of research and development. From one point of view, MADWH/MADM is to promote CCI and DAI for scientific, engineering, and business applications including brain and neuroscience research itself where the two long term former adversaries of CI and AI can join forces with other areas. From a web-based software engineering perspective, self-organization and reorganization brought up a major challenge in the design, implementation, reuse, and integration of

MADWH/MADM systems for personalization, user modeling, P2P, autonomy, and semiautonomy.

This work presents some concepts, identifies a number of challenges, and provides some prospects on MADWH/MADM with discussions on applicability, feasibility, and architectural design issues for different applications. Section 2 introduces some basic concepts in MADWH and MADM as a package. Section 3 discusses CCI vs. DAI. Section 4 reviews the example in [15] for further discussion. Section 5 presents a comparison between MADM and MRDM. Section 6 introduces an intermediate agent law for agentization and agent discovery. Section 7 discusses feasibility and applicability of orthogonal agent association. Section 8 identifies a number of challenges ahead. Section 9 draws a few conclusions.

2. Basic Concepts MADWH/MADM focuses on the interplay of the

two. With a MADWH, MADM algorithms can be developed in an evolving dynamic environment with autonomous or semiautonomous agents especially CI agents. Instead of mining frequent itemsets from customer transactions or frequent patterns from multiple relations, MADM discovers new agents and mines agent associations in first-order logic for coordination based on agent similarity. The concept of agent similarity leads to the notions of agent cuboid, orthogonal MADWH and MADM.

The novelty of a MADWH lies in its ability to systematically combine neurofuzzy systems, multiagent systems, database systems, machine learning, data mining, information theory, neuroscience, decision, cognition, and control all together into a modern multidimensional information system architecture that is ideal for brain modeling of different animal species with manageable complexity. Although examples in robot control are used to illustrate the basic ideas as in [13,14,15], the new approach is generally suitable for data mining tasks where knowledge can be discovered collectively by a set of similar semiautonomous or autonomous agents from a geographically, geometrically, or timely distributed high-dimensional data environments.

3. CCI vs. DAI While multiagent systems (MAS) is originated from distributed AI (DAI) research and MADM can be considered distributed data mining, the term “multiagent data warehousing (MADWH)” is coined for brain modeling and neurofuzzy control [13,14,15]. It is a continuing research effort in coordinated computational intelligence (CCI) [10,11,12].

AI and CI research communities are well-known long term former adversaries which are to be brought


together with the MADWH/MADM platform. While traditional AI systems depend on symbolic reasoning techniques; numerical AI systems (fuzzy, neural, and/or genetic systems) mainly employee numerical computation for learning purposes. Due to its computational characteristics, numerical AI is scientifically renamed as computational intelligence (CI) [2]. Since CI components are fine-grained and numerical, their coordination is excluded from the distributed artificial intelligence (DAI) research [3]. On the other hand, since CI components are inherently distributed and computational, decomposition and coordination of CI systems have so far been mostly buried in computation with a few exceptions [10-15].

As an AI subfield, DAI research focus on coordination and cooperation methodologies among coarse-grained multiagent systems (MASs) [3,5,6]. A MAS is concerned with the behaviors among a collection of autonomous agents and how they can coordinate their knowledge, goals, skills and plans jointly to take actions or to solve problems collectively. Agents in a MAS may be working toward a single global goal, or toward separate but related and sometimes conflicting individual goals. So agents must share knowledge about problems and developing solutions, must resolve their conflicts and reach compromised or optimal global solutions, and must reason about the processes of coordination among the agents [3].

An ideal autonomous DAI agent is a real world entity that has identity, knowledge, states, behaviors, and learning abilities. While being coordinated through communications, DAI agents form a MAS. A MAS is heterogeneous if the agents in the MAS are of different types; it is homogeneous if all agents are of the same type. A virtual agent (VA) [16] can be defined as a characterization or an image of an autonomous agent. A VA can also be a neural agent if the neural agent is a reflection of another agent in a brain system. Such VAs or neural agents can form a virtual community (VC) [16] that can be modeled as a MADWH.

The notion of "agent" should be central in CCI as well as in DAI. While decision makers, autonomous robots, and networked intelligent information systems are typical DAI agents; it would be very hard to imagine that the biological or artificial neural system of an autonomous agent could function well without a school of intermediate cerebral/cerebellar agents between itself and its memory cells, fuzzy rules, and billions of neurons. Evidently, we have to answer the questions:

(1) What is an agent in CCI ?

(2) What are the differences and similarities of CCI and DAI?

(3) How are CCI agents identified and coordinated?

(4) What can CCI offer to autonomous machine learning and control? and

(5) What can CCI offer to legged locomotion?

A cerebral/cerebellar agent in CCI [12] is referred to as a semiautonomous cognitively identifiable neuro/fuzzy/ genetic subsystem that (i) possesses partial states, knowledge, behaviors, learning and decision abilities of an autonomous agent; (ii) can communicate with other agents and forms, together with the others, a multiagent cerebrum/cerebellum (MAC) model of an autonomous agent; and (iii) does not lose its learning and decision abilities because of partial damages to others. A MAC model is homogeneous if all its agents perform the same type of function; otherwise, it is heterogeneous. A MAC model is a centralized model if there is a central coordinator; it is a distributed model if there is no central coordinator and coordination is achieved through communication protocols; it is a federated model if both centralized control and distributed protocols are used.

A cerebral/cerebellar agent is a fuzzy agent if the agent’s knowledge representation is based on fuzzy sets, learning and control are accomplished via fuzzy rules and/or fuzzy pattern recognition. A cerebral/cerebellar agent is an associative memory-based agent if the agent’s knowledge is represented in table or matrix forms where learning and control are accomplished via table-driven adaptive schemes. A cerebral/cerebellar agent is a neural agent if learning and control are accomplished via neural nets with neurons. An agent is a neurofuzzy agent if both neural and fuzzy techniques are combined.

The notion of CCI follows the hypothesis that the brain system of an autonomous agent consists of a school of cerebral/cerebellar agents which reflect the conceptual and/or physical world including the body states of the agent itself [12,15]. CCI is mainly concerned with: (1) agent-oriented decomposition of a brain system; (2) the coordination of neurofuzzy agents; (3) the formation of a MAC model; (4) the adaptive, incremental, exploratory, and explosive learning behaviors of a MAC model; and (5) self-organization and reorganization of CCI agents.

With the above definitions, low level fuzzy rules, associative memory cells, or neurons can not be considered as CCI agents because they, individually, do not show any cognitively identifiable agent behaviors or learning abilities. An associative memory module, on the other hand, can be considered a CCI agent if it meets the conditions for a CCI agent. It is interesting to consider the left and right cerebellum subsystems. Apparently, they are not fully autonomous, but they fit well into the category of homogeneous CCI agents. It is not too unusual to see someone who is half paralyzed


due to left or right side neural damage but who may still be able to move on one leg with some support. This is a typical example of homogeneous semiautonomy. On the other hand, a MAC system with vision and hearing agents and/or arm and leg control agents is clearly heterogeneous.

As a CI subfield, CCI should share the following common characteristics with DAI:

(1) both are defined in an agent-oriented and distributed world;

(2) both can use cooperation as well as competition strategies;

(3) both need conflict resolution;

(4) both need communication; and

(5) both use coordination as a key.

A dividing line between CCI and DAI can be drawn with the following essential distinctions:

(1) DAI relies on symbolic representation and reason-ning; CCI mainly relies on numerical represen-tations and neuro/fuzzy/genetic learning schemes.

(2) A multiagent cerebrum/cerebellum (MAC) model in CCI is defined in a fine-grained semiautonomous neuro/fuzzy/genetic agent world; a multiagent system (MAS) in DAI [3,5] is defined in a coarse-grained autonomous agent world.

(3) DAI agents are mostly loosely-coupled systems that use intercommunications; CCI agents are tightly-coupled sub-systems that use intracommunications or brainstorming.

(4) A MAS consists of a collection of autonomous agents which can take actions individually with or without coordination; a MAC model consists of a collection of semiautonomous cerebral/cerebellar agents which can make decisions individually or collectively but normally do not take actions without coordination.

(5) DAI aims at enhancing effectiveness and efficiency of combined social and physical organizations, CCI searches for an agent-oriented brain architecture for an autonomous agent to emulate human learning and control.

(6) DAI agents adapt into certain social protocols for their coordination, CCI agents adapt into social protocols and common sense versions of natural motion laws (referred to as cerebellar laws in this paper) for their coordination. Particularly, cerebellar agents rely heavily on cerebellar laws for coordi-nation due to the nature of its motion control tasks.

The interplay of CCI and DAI can be essential in solving complex distributed problems. A MAC system

can be an agent of a MAS. On the other hand, the semiautonomous cerebral/cerebellar agents of a MAC system can reflect the states of the autonomous agents of a MAS. Therefore, CCI can be used in the coordination of DAI agents and vice versa.

4. A MADWH and MADM Approach for Brain Modeling and NeuroFuzzy Control 4.1 Agent Cuboids vs. Data Cuboids

In [10-15], it is shown that an agent can be an autonomous or semiautonomous neurofuzzy agent for robot control. Two simulated unipeds are sketched in Fig. 1. The goal is to enable a simulated N-link uniped (whose motion is governed by a set of 2nd-order differential equations that has infinite number of inverse solutions) to learn gymnastic jumps. Each jump can be characterized with a <V, M> pair, where V is a control vector and M is a measure vector as defined in Fig. 1 for a 3- and a 4-link uniped.

Note that the 4-link (foot, lower leg, upper leg, and body) V vector has 10 dimensions and M vector has 7 dimensions. The angles θ1-θ4 in V determine the take-off configuration of the robot; T1-T3 are torque applied to the three joints for taking off; T4-T6 are torque applied to the joints in flight to configure the robot for proper landing. The torques can be replaced with desired joint angles θd1,θd2,θd3 for landings. H, D, and A define the jump height, distance, and landing angle and the four angles θL1, θL2, θL3, and θL4 define the landing configuration. The landing configuration can be equivalently determined by (A,LH,θLx,θLy) where A is landing angle, LH is landing mass center height, θLx ,θLy are any two different link angles. A 3-link uniped has two joints and needs two take-off torques and two in-flight torques for a jump.

In this case, a brain system is hypothesized as a society of semiautonomous neural agents that can be cognitively identified by taking off configurations of a jumping uniped [10-15] for different long, short, backward, and forward gymnastic jumps. Based on agent identification, agent similarity, and function similarity, we have the concepts of agent cuboids. While data cuboids organize relevant data sets into hypercube structures for business decision support, agent cuboids organize relevant or similar (cooperative or competitive) neural agents into hypercube brain structures for coordinated machine learning and control.

For any orthogonal neighborhood, we have the application-specific associations in first order predicate logic as in Fig. 2. Fig. 3 shows the 4-link uniped configurations of 16 neural agents that can be organized in a 4-D agent cube as in Fig. 4.


Fig. 5 shows two link weights matrices of a trained 3-layer BP neural controller with an error rate of 0.000009. It is assumed that an autonomous agent coordinates many semiautonomous neural agents in the MADWH. Whenever an agent is called from the warehouse, the link weights are assigned to the neural controller to generate a V vector for a desired jump measure vector M.

Five jumps by five corner agents of a 3-link uniped are listed in Fig. 6. Evidently, agents A and B are similar based on their jumps because they differ on one corner parameter (θ1) and they are able to make almost the same jump with different joint torques. Agent C and D are similar also. But C can make longer jumps with the same height. Agent D and E are also similar, Agent E can make an even longer jump. (Note: angles are in degrees and height and distance are in meter.) Thus, every pair of neighbor agents can be considered similar. Here only one pair of actions is selected. If it is selected from 100 actions, the support is 0.01. Since only one pair of actions is tested successfully, the confidence is 1.0.

Every pair of neighbor agents in the agent cuboids of Fig. 4 are tested as similar agents with high support and confidence measures. It is then can be concluded that both agent cuboids are orthogonal. From Fig. 6, we can see that the similar action measures are different only on the distance dimension. The application-specific 1st order association rules in this case can be determined as in Fig. 7. Interestingly, the two association rules are evidently dynamic motion laws in predicate logic form. Such laws can be used as meta knowledge for further coordinated data mining or knowledge discovery.

Similarity leads to agent interpolation and extrapolation. in a global mining process. MADWH provides an efficient and adequate platform for modeling a brain system in performing coordinated adventures. At the neural network level, interpolation and extrapolation result in weight matrices for a new neural net assuming the same neural architecture. The weight matrices can be used as initial weights for training interpolated or extrapolated neural controllers. This can reduce training time dramatically compared with using random initial link weights. Given two similar BP neural agents A and B with neural weight matrices WA and WB, respectively, based on the dynamic motion laws in Fig. 7 we have

WI ≈ (WA+ WB)/2 and WE ≈ 2WA - WB;

where WI is a weight matrix of an interpolated neural agent, and WE is a weight matrix of an extrapolated neural agent assuming the same neural architecture.

Interpolation and extrapolation is learning by agent discovery. The learning speed of agent discovery is

geometrical for uniped locomotion control [12]. Therefore, the orthogonal neural agent representation of dynamic motion laws provides effective inverse dynamics for the 2nd order differential (motion) equations that govern the motion of an autonomous agent. Such a brain structure may well explain the phenomena that an animal can learn and apply dynamic motion laws without understanding them.

With the similarity law, agent extrapolation allocates new agents in the plausible directions. The implausible directions are marked with DeadEnd. A DeadEnd is also an important discovery. It helps redirecting the exploration toward the plausible direction. It emulates the process of outcropping in mineral deposit exploration by a team of miners. Fig. 8 shows the exploration in the long jump direction with agent A, B, C, and D as in Fig. 6.

It should be remarked that traditional data warehouse query languages and utilities can be extended for a MADWH. Agent-oriented drill-down, roll-up, slice, dice, and pivot with a MADWH can support brain analysis and cognition at different levels of concentration. Some spacio-temporal patterns in multiagent brain modeling are sketched in Fig. 9. The snap-shot of a controlled jump is shown in Fig. 10. A model for mental concentration in gymnastics is sketched in Fig. 11 where the smaller a kernel space the more concentrated and more precise the controlled jump.

Fig. 1. Control and measure parameters of an action

Rule1

∀A1,A2,{action(A1, measurei, measurej1) ∧ action(A2, measurei, measurej2) ∧ neighbor(A1, A2) ⇒ ∃A3 , action(A3, measurei, (measurej1+ measurej2)/2)∧neighbor(A1,A3,A2)};

Rule2

∀A1,A2,A3,{action(A1, measurei, measurej1) ∧ action(A2, measurei, measurej2) ∧ neighbor(A1, A2, A3) ⇒ action(A3, measurei, (measurej1+ 2×measurej2)/2)}; OR ∀A1,A2,A3,{action(A1, measurei, measurej1) ∧ action(A2, measurei, measurej2) ∧ neighbor(A3, A1, A2) ⇒ action(A3, measurei, (2×measurej1+measurej1)/2};

Fig. 2 Application specific association rules (adapted from [15])

3-link: V = (θ1, θ2, θ3, T1, T2, T3, T4) or V = (θ1, θ2, θ3, T1, T2, θd1,θd2); M = (H, D, Am, θL1, θL2, θL3) ≡ (H, D, A, LH, Am, θLx, θLy), x,y = 1, 2, or 3, x≠y. 4-link: V = (θ1, θ2, θ3, θ4, T1, T2, T3, T4, T5, T6) or V = (θ1, θ2, θ3, θ4, T1, T2, T3, θd1,θd2,θd3) M = (H, D, Am, θL1, θL2, θL3, θL4)


Fig. 3 16 corner agents (adapted from [12])

Fig. 4 A 4-D base cuboid with 16 corner agents

⎥⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

3.492346 0.44012 1.559476 0.746009 0.118515 3.399694- 1.110398- 0.500489 0.178415 1.282669 0.268073- 0.549651- 2.248199- 0.968122-

2.533893- 0.778026- 2.318124- 1.475501- 1.486067- 1.602977- 0.949556- 0.622034 0.826496 0.843568 0.195054 0.709369 0.74155 0.740411

1.970769 2.000594 4.920876- 1.778801 0.977284 8.748863 4.225339 2.064914- 0.567861- 3.367871 1.805584- 0.803275- 5.789867- 3.380671-

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

1.199774- 0.510151- 2.685953- 1.359817- 1.101889- 2.739815 4.16698- 7.586217- 2.924549- 2.463824- 2.797362- 4.623281- 2.811486- 2.683949-

0.773187 5.800477 2.85431 0.129961- 0.539471- 6.934944- 3.565889 8.965652- 4.400636- 2.924098- 3.647653- 5.24406- 3.264428- 3.839715-

8.771423 4.19878 3.155711 3.542838 5.614085 3.03486 3.207681

n1: 6 n2: 6 n3: 5 Err: 0.000009 Fig. 5. Weights matrices of a 3-layer BP neural controller V: (θ1, θ2, θ3, T1, T2, T3, T4) M:(D, H, A, θL1, θL2, θL3)

A 170 20 105 -219 20 110 -20 0.3 0.9 -4 0.29 0.3 164 B 180 20 105 -227 20 104 -20 0.3 0.9 -4 0.29 0.3 164

C 170 30 93 -291 20 83 -20 0.69 1.3 -3.9 0.29 0.45 164 D 170 30 105 -284 20 135 -20 0.25 1.3 -4 0.29 0.26 164 E 170 20 93 -309 20 118 -20 1.0 1.3 -4 0.29 0.46 164

Fig. 6. Similar actions by similar agents (adapted from 15)

∀A1,A2,{jump(A1, H=x, D=short) ∧ jump(A2, H=x, D=long) ∧ neighbor(A1, A2) ⇒ ∃A3,jump(A3, H=x, D=medium-long)∧neighbor(A1,A3,A2)}; ∀A1,A2,A3,{jump(A1, H=x, D=short)∧ action(A2, H=x, D=medium-long) ∧ neighbor(A1,A2,A3) ⇒ action(A3, H=x, D=long)}; OR ∀A1,A2,A3,{jump(A1, H=x, D=medium-long) ∧jump(A2, H=x, D=long) ∧ neighbor(A3,A1,A2) ⇒ action(A3, H=x, D=short)}.

Fig. 7. Dynamic motion laws as agent association rules

A

BC

D

E

lo n g e r d is ta n c e

lo n g e s t d is ta n c e

s h o r te r d is ta n c e

A : (1 8 0 ,2 0 ,1 0 5 )

B : (1 7 0 ,2 0 ,1 0 5 )

C : (1 7 0 ,3 0 ,1 0 5 )

D : (1 7 0 ,3 0 ,9 3 )

E : (1 7 0 ,2 0 ,9 3 )

N e e d s s l ig h t ly la rg e r to rq u e fo r th e s a m e ju m p m e a su re

Fig. 8. Coordinated data outcropping and data mining

Fig. 9 A sketch of spacio-temporal patterns in brain modeling for uniped control (adapted from [12])

Fig. 10 Snapshots of a simulated controlled uniped jump Fig. 11 Neural agents for different levels of mental concentration in gymnastics


5. MADM vs. MRDM – A Comparison It is interesting to compare multiagent data mining (MADM) with multirelational data mining (MRDM). In [10-15], each neurofuzzy agent for a 3-link uniped has 14 dimentions, each agent for a 4-link uniped has 17 dimensions. There are infinite number of reverse of the 2nd-order differential equations governing the robot motion that lead to infinite number of solutions. Using agent association there could be many agent cuboids that form a agent community or society for different gymnastic jumps or for jumping control under different gravities or different terrains. The cognitive complexity would be unmanageable with a usual data warehouse. For instance, it would be very difficult if not impossible to represent the data and knowledge for visualization and decision using a table format without the multiagent approach and it would definitely be impossible to mine dynamic motion laws in 1st-order predicate logic without agent associations.

Here agent association is classified as orthogonal or non-orthogonal. We examine orthogonal agent association and leave non-orthogonal agent association for future study. An orthogonal agent association rule takes the general form

∀A(gent)1,A(gent)2, {P(A1,A2) ⇒ ∃A(gent)3{Q(A1,A2,A3)}}

which reads “for all agent1 and Agent2, IF the predicate P(A1,A2) is true THEN there exists some Agent3 SUCH THAT Q(A1,A2,A3) is true.” Two agent association rules are listed in Fig. 12 for agent interpolation and extrapolation in coordinated data outcropping and data mining as in the follows.

∀A1,A2, {similar(A1,A2) ⇒ ∃ A3{A3≈(A1+A2)/2 ∧

Similar(A3,A1)∧similar(A3,A2)}. – Interpolation Rule

∀A1,A2, {similar(A1,A2)∧similar(A2,A3)∧

(Distance(A2,A3)=d)∧ (Distance(A1,A3)=2d)}

⇒ A3≈2•A2-A1. – extrapolation rule

Fig. 12 Two association rules for uniped control

A close examination of the orthogonal association rules reveals that agent similarity is a key for orthogonality. There are a number of similarities between MADM and MRDM including:

(1) MADM and MRDM can both be used for machine learning and engineering applications.

(2) Both are suitable for knowledge discovery from high dimensional data environment.

(3) Both can be used for extracting association rules in zero- or first-order logic.

MADM can be distinguished from MRDM or graph-based mining [7,8] as in the follows.

(1) Agents are dynamic actors and/or controllers while relations and graphs are static data sets and structures.

(2) Agent association rules are rules governing agent communities while an item association is a rule about items and a relation association is a pattern regarding some relations that does not possess the cognitive identity, dynamics, learning, decision making, and control ability of an autonomous or semiautonomous agent.

(3) Orthogonal agent association is based on agent similarity as a priori knowledge while item association is based on item frequency as a priori knowledge and multirelational or graph association may use a relational strength in the real interval [0,1].

(4) Orthogonal Agent association may lead to the discovery of new neural, fuzzy, or genetic agents similar to existing agents while item association and MRDM or graph data mining are not motivated by agent discovery.

(5) Agents can be coordinated for collaborative knowledge discovery and decision making while item and relations can not be coordinated.

(6) Orthogonal agent associations lead to an orthogonal MADWH that resembles a brain system while relational associations are not agent-oriented.

(7) Coordinated multiagent data mining can discover dynamic motion laws that can not be observed in item and relational association rules.

(8) Agent association assumes a dynamic and distributed data environment while item and relational associations assume a static data source.

6. Intermediate Agent Law Orthogonal agent association rule mining in 1st order logic leads to the extension of the mean-value theorem in calculus to a commonsense intermediate agent law for MADWH/MADM.

Intermediate Agent Law. Given any pair of similar biological-system-inspired computational agents or agent communities A1 and A2 defined in a multidimensional space with a measurable distance 2d>0, a third similar agent or agent community A3 can be discovered or created such that A3 is similar to A1 and A2 with distance d to each of them.

In the above commonsense law, a biological-system-inspired computational agent could be any


computer based system, intelligent or not intelligent, autonomous or semiautonomous, mobile or stationary, ground or airborne, neural or genetic, CI or AI agent, or any other computational agent. Cruise controllers, robot controllers, artificial neurons and neural networks, gene expressions and genetic models, fuzzy controllers and systems, rough set-based systems, and different memory components are apparently such agents.

The commonsense law provides a basis for the extension of digitization to agentization. The concept of agentization is popular in military operational research. It is adapted into MADWH/MADM in ref. [12,15]. Here the term “agentization” stands for “populating a multidimensional space with virtual or real agents or agent communities.” With the intermediate agent law, an orthogonal MADWH can be defined as a virtual non-linear dynamic agentization and MADM can be defined as a coordinated discovery process for new agents, agent associations, agent organizations, and agent laws.

7. Feasibility and Applicability of Orthogonal Agent Association Feasibility. From the early discussions two major necessary conditions for orthogonal agent association and orthogonal MADWH can be derived.

(1) Agent association requires agent identification. Agent identification can be accomplished using information gain method, Gini method, or parameter analysis. In almost all robot learning tasks, some physical configurations of the robots (including ground robot, underwater robot, and flying robots) are almost certain to be good cognitive identities of neural agents for neural learning and control.

(2) Agent similarity is a key for orthogonal multiagent data warehousing. To define similarity, the corner or key parameters must be identified to define configuration similarity, agent capabilities must be tested to define function similarity.

Applicability. From the early discussions the following applicability conditions can be derived for orthogonal agent associaiton.

(1) The environment meets the two feasibility conditions;

(2) The learning/decision/control space is geometri-cally, geographically, conceptually, and/or timely distributed;

(3) The learning/decision/control task is dynamic in nature;

(4) Many autonomous or semiautonomous agents are needed for the learning/decision/control;

(5) Collective/explorative learning/decision/control is needed; and

(6) Coordination is possible.

It is evident that orthogonal agent association may not be suitable to a static data environment if the data storage is incomplete. It is also evident that orthogonal agent association is a good fit for brain modeling and robot learning/control because there could be billions of neurons in a brain system that has to be a large community or society of semiautonomous neural agents for the exploration of different dynamic data environments.

8. Challenges MADWH/MADM as an emerging research area is just in its infant status. Many tough challenges lay ahead. We enumerate some challenging issues as in the follows.

Bring CCI and DAI together. The first and foremost challenge is how to bring CCI and DAI together for the interplay of MADWH and MADM. It is expected that CCI and DAI will play a major role in MADWH/MADM. However, many questions are yet to be answered in this direction of research.

Warehousability, agent identification, agent-oriented decomposition, and agentization. The concepts of semiautonomy, full autonomy, agent cuboids, and agent society are for dealing with the complexity in brain modeling and agentization. It is suggested that all bio-like robots or devices can be decomposed into semi-autonomous agents based on their physical configurations. However, some agents are better fitted for MADWH and some are not. In general, the architecture and the embedded knowledge of neural networks and fuzzy controllers [9] can almost be completely characterized and stored in a MADWH. Therefore, it is easier to adapt MADWH/MADM into scientific and engineering applications. On the other hand, a mobile software agent can be stored in a warehouse for dispatching or “agentization”, but it might be difficult to identify its dimension. Many research efforts are needed for agent identification, agent-oriented decomposition, and agentization in different application domains especially for web applications.

Heterogeneity. In the robot control example, the agents are homogeneous because all are for gymnastic jumping. How to organize heterogeneous agents into a MADWH/MADM framework is a great challenge. A typical heterogeneous example in data mining is to


warehouse all agents for data cleaning/integration, selection/transformation, mining/discovery, pattern evaluation and visualization such that different matching sequences can be selected and optimized for different data mining tasks. Another typical example is to combine radio, audio, and motor control agents in brain modeling for autonomous learning/control. MADWH/MADM is not to provide the final solutions for such tough challenges but to provide enabling technologies that lead to evolving better solutions.

Schema design complexity. Multiple dimensional agent orientation is evidently a complex concept. Agent orientation can be considered an extension of object-orientation and object-oriented database design techniques can be borrowed. However, agent orientation has to fit into multiple dimensions.

Query language design complexity. To the author’s best knowledge, no such query languages have been developed yet. It is expected that agent-orientation can be mounted to SQL-based data mining query languages [4] like DMQL.

Complexity in agent discovery and law discovery. In [10-15], agent discovery is illustrated with interpolation and extrapolation, and law discovery is illustrated with mining agent-association in first-order logic for the specific application of neurofuzzy control. Similar discovery has not been fully researched in many other application areas.

Complexity in agent-oriented self-organization and reorganization. Although self-organization has been a hot topic in neural network research, self-organization and reorganization has not been fully addressed at the autonomous and semiautonomous agent levels. It is shown in [12] that self-organization and reorganization is possible in multiagent brain modeling. It is a challenging and interesting task to address the self-organization and reorganization issues at different levels of agent granularities, for instances, at macro-, micro-, neuron, genetic, and/or nano- levels.

Reinforced knowledge discovery with the interplay between MADWH/MADM. Although this seems to be a big challenge, it could be the most enjoyable step once the other difficulties have been resolved. MADWH provides the centralization of agent-oriented data, knowledge, and brain storming algorithms; MADM provides distributed mechanisms and methods for reinforced knowledge discovery. A MADWH can enhance MADM and MADM can further develop and refine

a MADWH [15]. The two can be considered a YinYang pair for equilibrium and harmony.

MADWH/MADM for neuroscience, bioinforma-tics, and biomedical research. Evidently, this is a forever challenging and forever promising area of research.

MADWH/MADM for wireless sensor networks and semantic web. Many research topics remain untouched in these areas.

MADWH/MADM for knowledge management. This opens a new avenue in management information system research in addition to P2P and B2B business models.

Apply MADWH/MADM for different engineering, scientific, government, military, and business applications. These applications are evidently domain-specific with the common agent-oriented approach to brain modeling. Since multiagent brain modeling and research will never end, the application of MADWH/MADM does not seems to have a boundary.

MADWH/MADM and agent-oriented software engineering paradigm. MADWH/MADM adds new challenges for agent-oriented software engineering. Agent-oriented self-organization and reorganization for coordinated data mining and knowledge discovery is a major challenge. Integration and optimization with different agents and subtasks for data cleaning, integration, selection, transformation, mining/discovery, pattern evaluation and visualization is a typical example.

9. Conclusions

Some basic concepts have been introduced for MADWH and MADM. A comparison has been provided between a traditional data warehouse and a MADWH, and between MADM and MRDM. The roles of CCI and DAI in MADWH and MADM have been discussed. A commonsense intermediate agent law has been posted. A number of challenges have been identified. Despite the great challenges, it can be concluded that

(1) MADWH/MADM provides a joint platform for many different research areas including CCI and DAI;

(2) It enables agent discovery, agent law discovery, self-organization, and reorganization.

(3) It enables full autonomy as the result of coordination of semiautonomous functionalities.

(4) It enables the modeling of evolving processes like growing and aging.


(5) The short term potential of MADW/MADM lies in its commercial values in multidimensional agent-oriented OLAP and OLAM; and

(6) Its long term impact is far-reaching because its potential in supporting scientific discoveries as well as in business decision support especially in discoveries about bio-agents and bio-inspired agents and laws at the macro-, micro-, and/or nano- levels is forever promising and challenging.

Reference [1] R. Agrawal, H. Hannila, R. Srikant, H. Toivonen,

and A. I. Verkamo. “Fast discovery of association rules.” In U. M. Fayyad, G. Piatesky-Shapiro, P. Smith, and R. Uthurusamy, Editors. Advances in Knowledge Discovery and Data Mining. Cambridge, MA: AAAI/MIT Press, 1996.

[2] J. C. Bezdek, "On the relationship between neural networks, pattern recognition, and intelligence," Int’l J. of Approximate Reasoning, Vol. 6, 1992, 85-107.

[3] A. H. Bond and L. Gasser, "An Analysis of Problems and Research in DAI," Readings in Distributed Artificial Intelligence. eds. A. H. Bond and L. Gasser, Morgan Kaufmann, 1998, 3-35.

[4] J. Han and M. Kamber. Data Mining, Concepts and Techniques, Morgan Kaufmann, 2001.

[5] M. N. Huhns, Editor, Distributed Artificial Intelligence. Pitman, London, 1987.

[6] M. N. Huhns, and M. P. Singh, Readings in Agents. Morgan Kaufmann Pub., 1997.

[7] MRDM'01: Workshop on multi-relational data mining. In conjunction with PKDD'01 and ECML'01, 2002. http://www.kiminkii.com/mrdm/.

[8] H. Toivonen, L. Dehaspe. “Discovery of Frequent Datalog Patterns.” Data Mining and Knowledge Discovery. 3:(1)7-16, 1999.

[9] H. Ying, Fuzzy Control and Modeling: Analytical Foundations and Applications, IEEE Press, 2000.

[10] W. Zhang, "MAC-J: A Self-Organizing Multiagent Cerebellar Model for Fuzzy-Neural Control of Uniped Robot Locomotion." Int'l J. on Intelli-gent Control and Sys., Vol. 1, No. 3, 1996, p339-354.

[11] W. Zhang, “Neurofuzzy Agents and Neurofuzzy Laws for Autonomous Machine Learning and Control,” IEEE Int’l Conf. on Neural Networks, Houston, TX, June, 1997, 1732-1737.

[12] W. Zhang, “Nesting, Safety, Layering, and Autonomy: A Reorganizable Multiagent Cerebellar Architecture for Intelligent Control – with Application In Legged Locomotion and Gymnastics.” IEEE Trans. on Systems, Man, and Cybernetics, Part B, Vol. 28, No. 3, June 1998, p357-375.

[13] W. Zhang, “Modeling A Cerebrum/Cerebellum System as An Evolving Multiagent Data Ware-house.” Proc. Of 6th Joint Conf. On Information Sciences (JCIS) – CIN, March 8-13, 2002, Duke University, NC. pp541-544.

[14] W. Zhang, “A Multiagent Data Warehousing and Multiagent Data Mining Approach to Cere-brum/Cerebellum Modeling.” Proc. of SPIE Int,l Conf. on Data Mining and Knowledge Discovery. April, 2002, Orlando, FL. pp261-271.

[15] W. Zhang and L. Zhang “A Multiagent Data Warehousing (MADWH) and Multiagent Data Mining (MADM) Approach to Brain Modeling and NeuroFuzzy Control.” Information Sciences, 167 (2004) 109-127.

[16] W. Zhang and M. Cheng, "Virtual Agents and Vir-tual Communities: An Agent-Oriented Software and Knowledge Engineering Paradigm for Distrib-uted Cooperative Systems." Proc. of the 5th Int’l Conf. on Software and Knowledge Engineering, San Francisco, June, 1993, pp207-214.

Acknowledgement: This work has been partially sup-ported by a grant for Faculty Development from Geor-gia Southern University, Statesboro, GA.


Multi-Party Sequential Pattern Mining Over Private Data

Justin Zhan1, LiWu Chang2, and Stan Matwin3

1,3School of Information Technology & Engineering, University of Ottawa, Canada2Center for High Assurance Computer Systems, Naval Research Laboratory, USA

{[email protected], [email protected], [email protected]

Abstract

Privacy-preserving data mining in distributedenvironments is an important issue in the field ofdata mining. In this paper, we study how to con-duct sequential patterns mining, which is one ofthe data mining computations, on private data inthe following scenario: Multiple parties, each hav-ing a private data set, want to jointly conduct se-quential pattern mining. Since no party wants todisclose its private data to other parties, a securemethod needs to be provided to make such a com-putation feasible. We develop a practical solutionto the above problem in this paper.

Keywords: Privacy, security, sequential patternmining.

1 Introduction

Data mining and knowledge discovery indatabases is an important research area that inves-tigates the automatic extraction of previously un-known patterns from large amounts of data. Theyconnect the three worlds of databases, artificial in-telligence and statistics. The information age hasenabled many organizations to gather large vol-umes of data. However, the usefulness of this datais negligible if meaningful information or knowl-edge cannot be extracted from it. Data miningand knowledge discovery, attempts to answer thisneed. In contrast to standard statistical methods,data mining techniques search for interesting in-formation without demanding a priori hypotheses.As a field, it has introduced new concepts and are

becoming more and more popular with time.One of important computations is sequential

pattern mining [1, 8, 2, 7, 3], which is concernedof inducing rules from a set of sequences of or-dered items. The main computation in sequentialpattern mining is to calculate the support mea-sures of sequences by iteratively joining those sub-sequences whose supports exceed a given thresh-old. In each of above works, an algorithm is pro-vided to conduct such a computation assume thatthe original data are available. However, conduct-ing such a mining without knowing the originaldata is challenging.

Generic solutions for any kind of secure collab-orative computing exist in the literature [4, 5, 6].These solutions are the results of the studies of theSecure Multi-party computation problem [9, 5, 6,4], which is a more general form of secure collabo-rative computing. However, the proposed genericsolutions are usually impractical. They are notscalable and cannot handle large-scale data setsbecause of the prohibitive extra cost in protectingdata secrecy. Therefore, practical solutions needto be developed. This need underlies the rationalefor our research.

2 Mining Sequential Patterns On Pri-vate Data

2.1 Background

Data mining includes a number of differenttasks. This paper studies the sequential patternmining problem. Since its introduction in 1995[1], the sequential pattern mining has received a


great deal of attention. It is still one of the mostpopular pattern-discovery methods in the field ofKnowledge Discovery. Sequential pattern miningprovides a means for discovering meaningful se-quential patterns among a large quantity of data.For example, let us consider the sales database ofa bookstore. The discovered sequential patterncould be like “70% of people who bought HarryPorter also bought Lord of Ring at a later time”.The bookstore can use this information for shelfplacement, promotions, etc.

In the sequential pattern mining, we are given adatabase D of customer transactions. Each trans-action consists of the following fields: customer-ID, transaction-time, and the items purchased inthe transaction. No customer has more than onetransaction with the same transaction-time. Wedo not consider quantities of items bought in atransaction: each item is a binary variable repre-senting whether an item was bought or not. Anitemset is a non-empty set of items. A sequenceis an ordered list of itemsets. A customer supportis a sequence s if s is contained in the customer-sequence for this customer. The support for a se-quence is defined as the fraction of total customerswho support this sequence. Given a databaseD of customer transactions, the problem of min-ing sequential patterns is to find the maximal se-quences among all sequences that have a certainuser-specified minimum support. Each such max-imal sequence represents a sequential pattern.

2.2 Problem Definition

We consider the scenario where multiple par-ties, each having a private data set (denoted byD1, D2, · · ·, and Dn respectively), want to col-laboratively conduct sequential pattern mining onthe union of their data sets. Because they are con-cerned about data privacy, neither party is willingto disclose its raw data set to others. Without lossof generality, we make the following assumptionson the data sets (the assumptions can be achievedby pre-processing the data sets D1, D2, · · ·, andDn, and such pre-processing does not require oneparty to send its private data set to other parties):

1. D1, D2, · · ·, and Dn are datasets owned

by party 1, party 2, · · ·, and party n re-spectively, where each dataset consists of thecustomer-ID, transaction-time, and the itemspurchased in each transaction.

2. D1, D2, · · ·, and Dn contain different types ofitems (e.g., they come from different types ofmarkets).

3. The identity of the transactions in D1, D2,· · ·, and Dn are the same.

4. The customer-ID and customer’s transactiontime can be shared among the parities, butthe items that a customer actually bought areconfidential.

Mining Sequential Patterns On PrivateData problem: Party 1 has a private dataset D1, party 2 has a private data set D2, · · ·,and party n has a private data set Dn, data set[D1 ∪ D2 ∪ · · · ∪ Dn] is the union of D1, D2, · · ·,and Dn (by vertically putting D1, D2, · · ·, andDn together.)1 Let N be a set of transactionswith Nk representing the kth transaction. Thesen parties want to conduct the sequential patternmining on [D1 ∪D2 ∪D3 · · · ∪Dn] and to find thesequential patterns with support greater than thegiven threshold, but they do not want to sharetheir private data sets with each other. We saythat a sequential pattern of xi ≤ yj , where xi oc-curs before or at the same time as yj , has supports in [D1 ∪ D2 ∪ · · · ∪ Dn] if s% of the transac-tions in [D1 ∪D2 · · · ∪Dn] contain both xi and yj

with xi happening before or at the same time asyj (namely, s% = Pr(xi ≤ yj)).

2.3 Sequential Pattern Mining Procedure

The procedure of mining sequential patternscontains the following steps:

Step I: SortingThe database [D1 ∪D2 · · · ∪Dn] is sorted, with

customer ID as the major key and transaction

1Vertically partitioned datasets are also called heteroge-neous partitioned datasets where different datasets containdifferent types of items with customer IDs are identical foreach transaction.


time as the minor key. This step implicitlyconverts the original transaction database intoa database of customer sequences. As a result,transactions of a customer may appear in morethan one row which contains information of acustomer ID, a particular transaction time anditems bought at this transaction time. For exam-ple, suppose that datasets after being sorted bytheir customer-ID numbers are shown in Fig. 1.Then after being sorted by the transaction time,data tables of Fig. 1 will become those of Fig. 2.

Step II: Mapping

Each item of a row is considered as an attribute.We map each item of a row (i.e., an attribute) toan integer in an increasing order and repeat for allrows. Re-occurrence of an item will be mappedto the same integer. As a result, each item be-comes an attribute and all attributes are binary-valued. For instance, the sequence < B, (A,C) >,indicating that the transaction B occurs prior tothe transaction (A,C) with A and C occurring to-gether, will be mapped to integers in the orderB → 1, A → 2, C → 3, (A, C) → 4. During themapping, the corresponding transaction time willbe kept. For instance, based on the sorted datasetof Fig. 2, we may construct the mapping table asshown in Fig. 3. After the mapping, the mappeddatasets are shown in Fig. 4.

Step III: Mining

Our mining procedure will be based on mappeddataset. The general sequential pattern miningprocedure contains multiple passes over the data.In each pass, we start with a seed set of large se-quences, where a large sequence refers to a se-quence whose itemsets all satisfy the minimumsupport. We utilize the seed set for generating newpotentially large sequences, called candidate se-quences. We find the support for these candidatesequences during the pass over the data. At theend of each pass, we determine which of the can-didate sequences are actually large. These largecandidates become the seed for the next pass.

The following is the procedure for mining se-quential patterns on [D1 ∪ D2 · · · ∪ Dn].

1. L1 = large 1-sequence2. for (k = 2; Lk−1 6= φ; k++) do{3. Ck = apriori-generate(Lk−1)4. for all candidates c ∈ Ck do {5. Compute c.count

(Section 2.4 will show how to compute thiscount on private data)

6. Lk = Lk ∪ c | c.count ≥ minsup7. end8. end9. Return UkLk

where Lk stands for a sequence with k itemsetsand Ck stands for the collection of candidate k-sequences. The procedure apriori-generate isdescribed as follows:

First: join Lk−1 with Lk−1:

1. insert into Ck

2. select p.litemset1, · · ·, p.litemsetk−1,q.litemsetk−1, where p.litemset1 =q.litemset1, · · ·,p.litemsetk−2 = q.litemsetk−2

3. from Lk−1 p, Lk−1 q.

Second: delete all sequences c ∈ Ck such thatsome (k-1)-subsequence of c is not in Lk−1.

Step IV: Maximization

Having found the set of all large sequences S, weprovide the following procedure to find the maxi-mal sequences.

1. for (k = m; k ≤ 1; k- -) do2. for each k-sequence sk do3. Delete all subsequences of sk from S

Step V: Converting

The items in the final large sequences are con-verted back to the original item representationsused before the mapping step. For example, if 1Abelongs to some large sequential pattern, then 1Awill be converted to item 30, according to the map-ping table, in the final large sequential patterns.


Alice Bob Carol

1 06/25/03 30 1 06/30/03 90 1 06/28/03 110

2 06/10/03 10, 20 2 06/15/03 40, 60 2 06/13/03 107

3 06/25/03 30 3 06/10/03 45, 70 3 06/26/03 105, 106

2 06/20/03 9, 15 3 06/18/03 35, 50 3 06/19/03 103

3 06/30/03 5, 10 3 06/21/03 101, 102

C-ID T-time Items Bought C-ID T-time Items Bought C-ID T-time Items Bought

Figure 1. Raw Data Sorted By Customer ID

2.4 How to computec.count

To compute c.count, in other words, to com-pute the support for some candidate pattern (e.g.,P (xi ∩ yi ∩ zi|xi ≥ yi ≥ zi)), we need to conducttwo steps: one is to deal with the condition partwhere zi occurs before yi and both of them oc-cur before xi; the other is to compute the actualcounts for this sequential pattern.

If all the candidates belong to one party, thenc.count, which refers to the frequency counts forcandidates, can be computed by this party sincethis party has all the information needed to com-pute it. However, if the candidates belong to dif-ferent parties, it is a non-trivial task to conductthe joint frequency counts while protecting the se-curity of data. We provide the following steps toconduct this cross-parties’ computation.

2.4.1 Vector Construction

The parties construct vectors for their own at-tributes (mapped-ID). In each vector constructedfrom the mapped dataset, there are two compo-nents: one consists of the binary values (called thevalue-vector); the other consists of the transactiontime (called the transaction time-vector). Supposewe want to compute the c.count for 2A ≥ 2B ≥ 6Cin Fig. 4. We construct three vectors: 2A, 2B and6C depicted in Fig. 5.

2.4.2 Transaction time comparison

To compare the transaction time, each time-vectorshould have a value. We let all the parties ran-domly generate a set of transaction time for en-

tries in the vector where their values are 0’s. Theythen transform their values in time-vector into realnumbers so that if transaction tr1 happens ear-lier than the transaction tr2, then the real num-ber to denote tr1 should smaller than the num-ber that denotes tr2. For instance, ”06/30/2003”and ”06/18/2003” can be transform to 363 and361.8 respectively. The purpose of transformationis that we will securely compare them based theirreal number denotation. Next, we will present asecure protocol that allows n parties to comparetheir transaction time.

The goal of our privacy-preserving classificationsystem is to disclose no private data in every step.We firstly select a key generator who produces theencryption and decryption key pairs. The compu-tation of the whole system is under encryption.For the purpose of illustration, let’s assume thatPn is the key generator who generates a homo-morphic encryption key pair (e, d). Next, we willshow how to conduct each step.

2.4.3 The Comparison of TransactionTime

Without loss of generality, assuming there are ktransaction time: e(g1), e(g2), · · ·, and e(gk), witheach corresponding to a transaction of a particularparty.

Protocol 1. .

1. Pn−1 computes e(gi)×e(gj)−1 = e(gi−gj) forall i, j ∈ [1, k], i > j, and sends the sequencedenoted by ϕ to Pn in a random order.

2. Pn decrypts each element in the sequence ϕ.


Alice Bob Carol

1 06/28/03 110

1 06/30/03 90

2 06/10/03 10, 20

2 06/13/03 107

2 06/15/03 40, 60

2 06/20/03 9, 15

3 06/10/03 45, 70

3 06/18/03 35, 50

3 06/19/03 103

3 06/21/03 101, 102

3 06/25/03 30

3 06/26/03 105, 106

3 06/30/03 5, 10

N/A N/A

N/A N/A

N/A N/A

N/A N/A

N/A N/A

N/A N/A

N/A: The information is not available.

C-ID T-tme Items Bought C-ID T-time Item Bought C-ID T-time Item Bought

N/A

N/A

1 06/25/03 30 N/A N/A

N/A

N/A N/A

N/A N/A

N/A N/A

N/A N/A

N/A N/A

N/A

Figure 2. Raw Data Sorted By Customer ID and Transaction Time

Alice 30 - 1A 10 - 2A 20 - 3A (10, 20) - 4A 9 - 5A 15 - 6A (9, 15) - 7A 5 - 8A (5, 10) - 9A

Note that, in Alice’s dataset, item 30 and 10 are reoccurred, so we map them to the same mapped-ID.

Bob 90 - 1B 40 - 2B 60 - 3B (40, 60) - 4B 35 - 5B 50 - 6B (35, 50) - 7B 45 - 8B 70 - 9B (45, 70)- 10B

Carol 110 - 1C 107 - 2C 103 - 3C 101 - 4C 102 - 5C (101,102)- 6C 105 - 7C 106 - 8C (105,106) - 9C

Figure 3. Mapping Table

He assigns the element +1 if the result of de-cryption is not less than 0, and −1, otherwise.Finally, he obtains a +1/−1 sequence denotedby ϕ′.

3. Pn sends +1/− 1 sequence ϕ′.

4. Pn−1 compares the transaction time of eachentry of vectors such as 2A, 2B, and 6C inour example. She makes a temporary vec-tor T. If the transaction time does not satisfythe requirement of 2A ≥ 2B ≥ 6C, she setsthe corresponding entries of T to 0’s; other-wise, she copies the original values in 6C toT (Fig. 5).

Theorem 1. (Correctness). Protocol 1 correctly

sort the transaction time.

Proof. Pn−1 is able to remove permutation effectsfrom ϕ′ (the resultant sequence is denoted by ϕ′′)since she has the permutation function that sheused to permute ϕ, so that the elements in ϕ andϕ′′ have the same order. It means that if the qthposition in sequence ϕ denotes e(gi−gj), then theqth position in sequence ϕ′′ denotes the evaluationresults of gi − gj . We encode it as +1 if gi ≥gj , and as -1 otherwise. Pn−1 has two sequences:one is the ϕ, the sequence of e(gi − gj), for i, j ∈[1, k](i > j), and the other is ϕ′′, the sequence of+1/−1. The two sequences have the same numberof elements. Pn−1 knows whether or not gi is largerthan gj by checking the corresponding value in the


Mapped

3 0 N/A 0 N/A 0 N/A 0 N/A 1 06/18/03 1 06/18/03 1 06/10/03 1 06/10/03 1 06/10/03 1 06/10/03

2 0 N/A 1 06/15/03 1 06/15/03 1 06/15/03 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A

1 1 06/30/03 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A

1B 2B 3B 4B 5B 6B 7B 8B 9B 10B

3 0 N/A 0 N/A 1 06/19/03 1 06/21/03 1 06/21/03 1 06/21/03 1 06/26/03 1 06/26/03 1 06/26/03

2 0 N/A 1 06/10/03 1 06/10/03 1 06/10/03 1 06/20/03 1 06/20/03 1 06/20/03 0 N/A 0 N/A

N/A : The information is not available.

C-ID ID

C-ID ID

C-ID ID

2 0 N/A 1 06/13/03 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A

1C 2C 3C 4C 5C 6C 7C 8C 9C

1 1 06/28/03 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A

Carol

Mapped

Bob

Alice

Mapped

3 1 06/25/03 1 06/30/03 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A 1 06/30/03 1 06/30/03

1 1 06/25/03 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A 0 N/A

1A 2A 3A 4A 5A 6A 7A 8A 9A

Figure 4. Data After Being Mapped

g1 g2 g3 · · · gk

g1 +1 +1 -1 · · · -1g2 -1 +1 -1 · · · -1g3 +1 +1 +1 · · · +1· · · · · · · · · · · · · · · · · ·gk +1 +1 -1 · · · +1

Table 1.

ϕ′′ sequence. For example, if the first element ϕ′′

is −1, Pn−1 concludes gi < gj . Pn−1 examinesthe two sequences and constructs the index table(Table 1) to compute the largest element.

In Table 1, +1 in entry ij indicates that thevalue of the row (e.g., gi of the ith row) is notless than the value of a column (e.g., gj of thejth column); -1, otherwise. Pn−1 sums the indexvalues of each row and uses this number as theweight of the information gain in that row. Shethen sorts the sequence according the weight.

To make it clearer, let’s illustrate it by anexample. Assume that: (1) there are 4 ele-

S1 S2 S3 S4 WeightS1 +1 -1 -1 -1 -2S2 +1 +1 -1 +1 +2S3 +1 +1 +1 +1 +4S4 +1 -1 -1 +1 0

Table 2.

ments with g1 < g4 < g2 < g3; (2) the se-quence ϕ is [e(g1−g2), e(g1−g3), e(g1−g4), e(g2−g3), e(g2 − g4), e(g3 − g4)]. The sequence ϕ′′ willbe [−1,−1,−1,−1, +1, +1]. According to ϕ andϕ′′, Pn−1 builds the Table 2. From the table, Pn−1

knows g3 > g2 > g4 > g1 since g3 has the largestweight, g2 has the second largest weight, g4 has thethird largest weight, g1 has the smallest weight.

Theorem 2. (Privacy-Preserving). Assuming theparties follow the protocol, the private data are se-curely protected.

Proof. We need prove it from two aspects: (1)


0 N/A

0 N/A

1 06/30/03

1 06/10/03

0 N/A

2A 6C

Step II

Step I

Step III

c.count

Secure Number Product Protocol

T

0 06/23/03

1 06/15/03

0 06/14/03

1 06/30/03 1 1 06/18/03

1 06/10/03 0 0 06/19/03

0 06/10/03 0 0 06/18/03

0 N/A

1 06/15/03

0 N/A

2B

1 06/18/03

Figure 5. An Protocol To Compute c.count

Pn−1 doesn’t get transaction time (e.g.,gi) for eachvector. What Pn−1 gets are e(gi − gj) for alli, j ∈ [1, k], i > j and +1/ − 1 sequence. Bye(gi−gj), Pn−1 cannot know each transaction timesince it is encrypted. By +1/ − 1 sequence, Pn−1

can only know whether or not gi is greater thanPj . (2) Pn doesn’t obtain transaction time for eachvector either. Since the sequence of e(gi − gj) israndomized before being send to Pn who can onlyknow the sequence of gi− gj , he can’t get each in-dividual transaction time. Thus private data arenot revealed.

Theorem 3. (Efficiency). The computation ofprotocol 1 is efficient from both computation andcommunication point of view.

Proof. The total communication cost is upperbounded by αm2. The total computation cost isupper bounded by m2 + m + 1. Therefore, theprotocols are very fast.

After the above step, they need to computec.count based their value-vector. For example,to obtain c.count for 2A ≥ 2B ≥ 6C in Fig. 5,they need to compute

∑Ni=1 2A[i] · 2B[i] · T [i] =

∑Ni=1 2A[i]·2B[i]·T [i] =

∑3i=1 2A[i]·2B[i]·T [i] = 0,

where N is the total number of values in each vec-tor. In general, let’s assume the value-vectors forP1, · · ·, Pn are x1, · · ·, xn respectively. Note thatP1’s vector is T . For the purpose of illustration,we denote T by xn−1. Next, we will show hown parties compute this count. without revealingtheir private data to each other.

2.4.4 The Computation of c.count

Protocol 2. Privacy-Preserving Number ProductProtocol

1. Pn sends e(xn1) to P1.

2. P1 computes e(xn1)x11 = e(xn1x11), thensends it to P2.

3. P2 computes e(xn1x11)x21 = e(xn1x11x21).

4. Continue until Pn−1 obtains e(x11x21 · · ·xn1).

5. Repeat all the above steps for x1i, x2i, · · ·,and xni until Pn−1 gets e(x1ix2i · · ·xni) forall i ∈ [1, N ].


6. Pn−1 computes e(x11x21 · · ·xn1) ×e(x12x22 · · ·xn2)× · · · × e(x1Nx2N · · ·xnN ) =e(x11x21 · · ·xn1 + x12x22 · · ·xn2 + · · · +x1Nx2N · · ·xnN ) = c.count.

Theorem 4. (Correctness). Protocol 2 correctlycompute c.count.

Proof. In step 2, P1 obtains e(xn1). Hethen computes e(xn1x11). In step 3, P2

computes e(xn1x11x21). Finally, in step5, Pn−1 gets e(x1ix2i · · ·xni). He thencomputes e(x11x21 · · ·xn1) × e(x12x22 · · ·xn2) ×· · · × e(x1Nx2N · · ·xnN ) = e(x11x21 · · ·xn1 +x12x22 · · ·xn2 + · · · + x1Nx2N · · ·xnN ) which isequal to c.count.


Proof. In protocol 2, all the data transmission arehidden under encryption. The parties who arenot the key generator can’t see other parties’ pri-vate data. On the other hand, the key generatordoesn’t obtain the encryption of other parties’sprivate data. Therefore, protocol 2 discloses noprivate data.

Theorem 6. (Efficiency). The computation ofc.count is efficient from both computation andcommunication point of view.

Proof. To prove the efficiency, we need conductcomplexity analysis of the protocol. The bit-wisecommunication cost of protocol 2 is α(n − 1)N .The computation cost of protocol 2 is nN , of pro-tocol 2 is 5t + 3. The total computation cost isupper bounded by nN − 1. Therefore, the proto-cols are sufficient fast.

3 Overall Discussion

Our privacy-preserving classification systemcontains several components. In Section 2.4.3, weshow how to correctly compare the transaction

time. In Section 2.4.4, we present protocols tocompute c.count. We discussed the correctness ofthe computation in each section.

As for the privacy protection, all the communi-cations between the parties are encrypted, there-fore, the parties who has no decryption key can-not gain anything out of the communication. Onthe other hand, there are some communication be-tween the key generator and other parties. Al-though the communications are still encrypted,the key generator may gain some useful informa-tion. However, we guarantee that the key gener-ator cannot gain the private data by adding ran-dom numbers in the original encrypted data sothat even if the key generator get the intermedi-ate results, there is little possibility that he canknow the intermediate results. Therefore, the pri-vate data are securely protected with overwhelm-ing probability.

In summary, we provide a novel solution for se-quential pattern mining over vertically partitionedprivate data. Instead of using data transforma-tion, we define a protocol using homomorphic en-cryption to exchange the data while keeping it pri-vate. Our mining system is quite efficient that canbe envisioned by the communication and compu-tation complexity. The total communication com-plexity is upper bounded by α(nN + m2 − N).The computation complexity is upper bounded bym2 + m + 5t + 4.

References

[1] Rakesh Agrawal and Ramakrishnan Srikant. Min-ing sequential patterns. In Philip S. Yu and ArbeeS. P. Chen, editors, Eleventh International Con-ference on Data Engineering, pages 3–14, Taipei,Taiwan, 1995. IEEE Computer Society Press.

[2] Jay Ayres, Johannes Gehrke, Tomi Yiu, and JasonFlannick. Sequential pattern mining using a bitmaprepresentation.

[3] G. Chirn. Pattern discovery in sequence databases:Algorithms and applications to DNA/protein clas-sification. PhD thesis, Department of Computerand Information Science, New Jersey Institute ofTechnology, 1996.

[4] O. Goldreich. Secure multi-party computation (working draft).


http://www.wisdom.weizmann.ac.il/home/oded/public html/foc.html, 1998.

[5] O. Goldreich, S. Micali, and A. Wigderson. How toplay any mental game. In Proceedings of the 19thAnnual ACM Symposium on Theory of Computing,pages 218–229, 1987.

[6] S. Goldwasser. Multi-party computations: Pastand present. In Proceedings of the 16th An-nual ACM Symposium on Principles of DistributedComputing, Santa Barbara, CA USA, August 21-241997.

[7] H. Kum, J. Pei, W. Wang, and D. Duncan. Approx-MAP: Approximate mining of consensus sequentialpatterns. Technical Report TR02-031, UNC-CH,2002.

[8] Ramakrishnan Srikant and Rakesh Agrawal. Min-ing sequential patterns: Generalizations and per-formance improvements. In Peter M. G. Apers,Mokrane Bouzeghoub, and Georges Gardarin, edi-tors, Proc. 5th Int. Conf. Extending Database Tech-nology, EDBT, volume 1057, pages 3–17. Springer-Verlag, 25–29 1996.

[9] A. C. Yao. Protocols for secure computations. InProceedings of the 23rd Annual IEEE Symposiumon Foundations of Computer Science, 1982.


Privacy-Preserving Decision Tree Classification Over VerticallyPartitioned Data

Justin Zhan1, Stan Matwin2, and LiWu Chang3

1,2School of Information Technology & Engineering, University of Ottawa, Canada3Center for High Assurance Computer Systems, Naval Research Laboratory, USA

{zhizhan, stan}@site.uottawa.ca, [email protected]

Abstract

Protection of privacy is one of important prob-lems in data mining. The unwillingness to sharetheir data frequently results in failure of collab-orative data mining. This paper studies how tobuild a decision tree classifier under the followingscenario: a database is vertically partitioned intomultiple pieces, with each piece owned by a partic-ular party. All the parties want to build a decisiontree classifier based on such a database, but due tothe privacy constraints, neither of them wants todisclose their private pieces. We build a privacy-preserving system, including a set of secure pro-tocols, that allows the parties to construct such aclassifier. We guarantee that the private data aresecurely protected.

Keywords: Privacy, decision tree, classification.

1 Introduction

Business success often relies on collaboration.The collaboration is even more critical in the mod-ern business world, not only because of mutualbenefit it brings but also because the coalitionof multiple parters will be more competitive thaneach individual. Assuming they trust each otherto a degree that they can share their private data,the collaboration becomes straightforward. How-ever, in many scenarios, sharing data are impos-sible because of privacy concerns. Thus, collabo-ration without sharing private data becomes ex-tremely important.

In this paper, we study a prevalent collabora-tion scenario, the collaboration involving a datamining task: multiple parties, each having a pri-vate data set, want to conduct data mining onthe joint data set that is the union of all indi-vidual data sets; however, because of the privacyconstraints, no party wants to disclose its privatedata set to each other. The objective of this pa-per is to develop efficient methods that enable thistype of computation while minimizing the amountof the private information that each party has todisclose.

Data mining includes various algorithms suchas classification, association rule mining, and clus-tering. In this paper, we focus on classification.There are two types of classification between twocollaborative parties: Figure 1(a) shows the dataclassification on the horizontally partitioned data,and Figure 1(b) shows the data classification onthe vertically partitioned data. To use the ex-isting data classification algorithms, on the hor-izontally partitioned data, all the parties need toexchange some information, but they do not neces-sarily need to exchange each single record of theirdata sets. However, for the vertically partitioneddata, the situation is different. A direct use ofthe existing data classification algorithms requiresone party to collect all other parties’ data to con-duct the computation. In situations where thedata records contain private information, such apractice will be infeasible.

We study the classification on the vertically par-titioned data, in particular, we study how to builda decision tree classifier on private data. In this


Private DataPrivate Data

Data Mining Data Mining

(a) Horizontally Partitioned Data (b) Vertically Partitioned Data

Priv

ate

Dat

a

Priv

ate

Dat

a

Figure 1.

problem, each single record is divided into multi-ple pieces, with each party knowing one piece. Wehave developed a privacy-preserving system thatallows them to build a decision tree classifier basedon their joint data.

2 Privacy-Preserving Decision-TreeClassification

Classification is an important problem in thefield of data mining. In classification, we aregiven a set of example records, called the trainingdata set, with each record consisting of several at-tributes. One of the categorical attributes, calledthe class label, indicates the class to which eachrecord belongs. The objective of classification isto use the training data set to build a model ofthe class label such that it can be used to classifynew data whose class labels are unknown.

Many types of models have been built for classi-fication, such as neural networks, statistical mod-els, genetic models, and decision tree models. Thedecision tree models are found to be the most use-ful in the domain of data mining since they obtainreasonable accuracy and they are relatively inex-pensive to compute. We define our problem asfollows:

Problem 1. We consider the scenario where nparties, each having a private data set (denotedby S1, S2, · · ·, and Sn respectively), want to col-laboratively conduct decision tree classification onthe union of their data sets. The data sets areassumed to be vertically partitioned. Because theyare concerned about the data privacy, neither party

is willing to disclose its raw data set to others.

Next, we give the notations that we will follow.

2.1 Notations

• e: public key.

• d: private key.

• Pi: the ith party.

• n: the total number of parties. Assuming n >2.

• m: the total number of class.

• xij : the jth element in Pi’s private attribute.

• α is the number of bits for each transmittedelement in the privacy-preserving protocols.

• N: the total number of records.

2.2 Decision Tree Classification Algorithm

Classification is one of the forms for data anal-ysis that can be used to extract models describingimportant data classes or to predict future data.It has been studied extensively by the communityin machine learning, expert system, and statisticsas a possible solution to knowledge discovery prob-lems.

The decision tree is one of the classificationmethods. A decision tree is a class discriminatorthat recursively partitions the training set untileach partition entirely or dominantly consists ofexamples from one class. A well known algorithm


for building decision tree classifiers is ID3 [13].We describe the algorithm below where S repre-sents the training samples and AL represents theattribute list:

ID3(S, AL)

1. Create a node V.2. If S consists of samples with all the same class

C then return V as a leaf node labelled withclass C.

3. If AL is empty, then return V as a leaf-nodewith the majority class in S.

4. Select test attribute (TA) among the AL withthe highest information gain.

5. Label node V with TA.6. For each known value ai of TA

(a) Grow a branch from node V for the con-dition TA = ai.

(b) Let si be the set of samples in S forwhich TA = ai.

(c) If si is empty then attach a leaf labelledwith the majority class in S.

(d) Else attach the node returned byID3(si, AL− TA).

According to ID3 algorithm, each non-leaf nodeof the tree contains a splitting point, and the maintask for building a decision tree is to identify anattribute for the splitting point based on the in-formation gain. Information gain can be com-puted using entropy. In the following, we assumethere are m classes in the whole training data set.Entropy(S) is defined as follows:

Entropy(S) = −m∑

j=1

Qj log Qj , (1)

where Qj is the relative frequency of class j inS. Based on the entropy, we can compute theinformation gain for any candidate attribute A ifit is used to partition S:

Gain(S, A) = Entropy(S)−∑v∈A

(|Sv ||S| Entropy(Sv)), (2)

where v represents any possible values of attributeA; Sv is the subset of S for which attribute A hasvalue v; |Sv| is the number of elements in Sv; |S|

is the number of elements in S. To find the bestsplit for a tree node, we compute information gainfor each attribute. We then use the attribute withthe largest information gain to split the node.

2.3 Cryptography Tools

In this paper, we use the concept of homomor-phic encryption which was originally proposed in[18]. Since then, many such systems have beenproposed [3, 15, 16, 17]. We observe that somehomomorphic encryption schemes, such as [4], arenot robust against chosen cleartext attacks. How-ever, we base our secure protocols on [17], whichis semantically secure [9].

In our secure protocols, we use additive homo-morphism offered by [17]. In particular, we utilizethe following characterizer of the homomorphicencryption functions: e(a1) × e(a2) = e(a1 + a2)where e is an encryption function; a1 and a2 arethe data to be encrypted. Because of the propertyof associativity, e(a1 + a2 + .. + an) can be com-puted as e(a1)×e(a2)×· · ·×e(an) where e(ai) 6= 0.That is

d(e(a1 + a2 + · · ·+ an)) = d(e(a1)× e(a2)× · · · × e(an)) (3)

d(e(α1)α2 ) = d(e(α1α2)) (4)

2.4 Privacy-Preserving Decision Tree Classifica-tion System

The privacy-preserving classification systemcontains several secure protocols that multipleparties need follow. There are five critical steps:

• To compute Entropy(Sv).

• To compute |Sv ||S| .

• To compute |Sv ||S| Entropy(Sv).

• To compute information gain for each candi-date attribute.

• To compute the attribute with the largest in-formation gain.


The goal of our privacy-preserving classificationsystem is to disclose no private data in every step.We firstly select a key generator who produces theencryption and decryption key pairs. The compu-tation of the whole system is under encryption.For the purpose of illustration, let’s assume thatPn is the key generator who generates a homo-morphic encryption key pair (e, d). Next, we willshow how to conduct each step.

2.4.1 Computation of e(Entropy(Sv))

Protocol 1. To compute e(Qj)

1. Pn sends e(xn1) to P1.

2. P1 computes e(xn1)x11 = e(xn1x11), thensends it to P2.

3. P2 computes e(xn1x11)x21 = e(xn1x11x21).

4. Continue until Pn−1 obtains e(x11x21 · · ·xn1).

5. Repeat all the above steps for x1i, x2i, · · ·,and xni until Pn−1 gets e(x1ix2i · · ·xni) forall i ∈ [1, N ].

6. Pn−1 computes e(x11x21 · · ·xn1) ×e(x12x22 · · ·xn2)× · · · × e(x1Nx2N · · ·xnN ) =e(x11x21 · · ·xn1 + x12x22 · · ·xn2 + · · · +x1Nx2N · · ·xnN ).

7. Pn−1 computes e(x11x21 · · ·xn1 +x12x22 · · ·xn2 + · · · + x1Nx2N · · ·xnN )

1N =

e(Qj).

Protocol 2. To compute e(Qjlog(Qj))

1. Pn−1 generates a set of random numbers r1,r2, · · ·, and rt.

2. Pn−1 sends the sequence of e(Qj), e(r1),e(r2), · · ·, e(rt) to Pn in a random order.

3. Pn decrypts each element in the sequence, andsends log(Qj), log(r1), log(r2), · · ·, log(rt) toP1 in the same order as Pn−1 did.

4. P1 adds a random number R to each of theelements, then sends them to Pn−1.

5. Pn−1 obtains log(Qj) + R and computese(Qj)(log(Qj)+R) = e(Qjlog(Qj) + RQj).

6. Pn−1 sends e(Qj) to P1.

7. P1 computes e(Qj)−R = e(−RQj) and sendsit to Pn−1.

8. Pn−1 computes e(Qjlog(Qj) + RQj) ×e(−RQj) = e(Qjlog(Qj)).

Protocol 3. To compute e(Entropy(Sv))

1. Repeat protocol 1-2 to compute e(Qjlog(Qj))for all j’s.

2. Pn−1 computes e(Entropy(Sv)) =∏j e(Qjlog(Qj)) = e(

∑j Qjlog(Qj)).

Theorem 1. (Correctness). Protocol 1-3 cor-rectly compute Entropy.

Proof. In protocol 1, Pn−1 obtains e(Qj). Inprotocol 2, Pn−1 gets e(Qjlog(Qj)). These twoprotocols are repeatedly used until Pn−1 obtainse(Qjlog(Qj)) for all j’s. In protocol 3, Pn−1 com-putes the entropy by all the terms previously ob-tained. Notice that although we use Entropy(Sv)to illustrate, Entropy(S) can be computed follow-ing the above protocols with different input at-tributes.


Proof. In protocol 1, all the data transmission arehidden under encryption. The parties who arenot the key generator can’t see other parties’ pri-vate data. On the other hand, the key generatordoesn’t obtain the encryption of other parties’sprivate data. Therefore, protocol 1 discloses noprivate data. In protocol 2, although Pn−1 sendse(Qj) to Pn, Qj is hidden by a set of randomnumbers known only by Pn−1. Thus private dataare not revealed. In protocol 3, the computationsare still under encryption, no private data are dis-closed either.

Theorem 3. (Efficiency). The computation ofEntropy is efficient from both computation andcommunication point of view.


Proof. To prove the efficiency, we need conductcomplexity analysis of the protocol. The bit-wisecommunication cost of protocol 1 is α(n− 1)N , ofprotocol 2 is α(3t + 5). The total communicationcost has the upper bound of αm(nN + 3t − N +5). The computation cost of protocol 1 is nN , ofprotocol 2 is 5t+3. The total computation cost isupper bounded by mnN + 5mt + 4m. Therefore,the protocols are sufficient fast.

2.4.2 The Computation of |Sv ||S| Entropy(Sv)

Protocol 4. To Compute |Sv ||S|

1. Pn−1 sends e(|Sv|) to the party (e.g., Pi) whoholds the parent node.

2. Pi computes e(|Sv|)1|S| = e( |Sv |

|S| ), then sendsit to Pn−1.

Up to now, Pn−1 has obtained e( |Sv ||S| ) and

e(Entropy(Sv)). Next, we discuss how to com-pute |Sv|

|S| Entropy(Sv).

Protocol 5. To Compute |Sv ||S| Entropy(Sv)

1. Pn−1 sends e( |Sv ||S| ) to P1.

2. P1 computes e( |Sv ||S| ) × e(R′) = e( |Sv|

|S| + R′)where R′ is a random number only known byP1, then sends e( |Sv|

|S| + R′) to Pn.

3. Pn decrypts it and sends |Sv ||S| + R′ to Pn−1.

4. Pn−1 computes e(Entropy(Sv))(|Sv ||S| +R′) =

e( |Sv||S| Entropy(Sv) + R′Entropy(Sv)).

5. Pn−1 sends e(Entropy(Sv)) to P1.

6. P1 computes e(Entropy(Sv))−R′ =e(−R′Entropy(Sv)), and sends it to Pn−1.

7. Pn−1 computes e( |Sv ||S| Entropy(Sv) +

R′Entropy(Sv)) × e(−R′Entropy(Sv)) =e( |Sv|

|S| Entropy(Sv)).

Theorem 4. (Correctness). Protocol 4-5 cor-rectly computes |Sv |

|S| Entropy(Sv).

Proof. In protocol 4, Pn−1 obtains e( |Sv||S| ). In pro-

tocol 5, Pn−1 gets |Sv ||S| Entropy(Sv). The compu-

tation uses the both properties of homomorphicencryption.


Proof. In protocol 4, all the data communicationare hidden under encryption. The key generatordoesn’t receive any data. The parties who are notthe key generator can’t see other parties’ privatedata. Therefore, protocol 4 discloses no privatedata. In protocol 5, although P1 sends e( |Sv |

|S| +R′)

to Pn, |Sv ||S| is hidden by a random number known

only by P1. Thus private data are not revealed.

Theorem 6. (Efficiency). The computation ofprotocol 4 and protocol 5 is efficient from bothcomputation and communication point of view.

Proof. To prove the efficiency, we need conductcomplexity analysis of the protocol. The totalcommunication cost is 7α. The total computa-tion cost is 8. Therefore, the protocols are veryefficient.

2.4.3 The Computation of the AttributeWith the Largest Information Gain

Following the above protocols, we can computee(Entropy(S)), |Sv|

|S| Entropy(Sv). What left is tocompute information gain for each attribute andselect the attribute with the largest informationgain.

Protocol 6. To Compute Information Gain forAn Attribute

1. Pn−1 computes∏

v∈A e( |Sv ||S| Entropy(Sv)) =

∑v∈A

|Sv||S| Entropy(Sv).

2. He computes e(∑

v∈A|Sv ||S| Entropy(Sv))−1 =

e(−∑v∈A

|Sv||S| Entropy(Sv)).

3. He computes e(Gain(S, A)) =e(Entropy(S))×e(−∑

v∈A|Sv||S| Entropy(Sv)).


Once we compute the information gain for eachcandidate attribute, we then compute the at-tribute with the largest information gain. Withoutloss of generality, assuming there are k informa-tion gains: e(g1), e(g2), · · ·, and e(gk), with eachcorresponding to a particular attribute.

Protocol 7. To Compute the Largest InformationGain

1. Pn−1 computes e(gi)×e(gj)−1 = e(gi−gj) forall i, j ∈ [1, k], i > j, and sends the sequencedenoted by ϕ to Pn in a random order.

2. Pn decrypts each element in the sequence ϕ.He assigns the element +1 if the result of de-cryption is not less than 0, and −1, otherwise.Finally, he obtains a +1/−1 sequence denotedby ϕ′.

3. Pn sends +1/ − 1 sequence ϕ′ to Pn−1 whocomputes the largest element.

Theorem 7. (Correctness). Protocol 6-7 cor-rectly computes the attribute with the largest in-formation gain.

Proof. In protocol 6, Pn−1 obtains e(Gain(S, A)).In protocol 7, Pn−1 gets the attribute with thelargest information. We discuss the details as fol-lows:

Pn−1 is able to remove permutation effects fromϕ′ (the resultant sequence is denoted by ϕ′′) sinceshe has the permutation function that she usedto permute ϕ, so that the elements in ϕ and ϕ′′

have the same order. It means that if the qthposition in sequence ϕ denotes e(gi−gj), then theqth position in sequence ϕ′′ denotes the evaluationresults of gi − gj . We encode it as +1 if gi ≥gj , and as -1 otherwise. Pn−1 has two sequences:one is the ϕ, the sequence of e(gi − gj), for i, j ∈[1, k](i > j), and the other is ϕ′′, the sequence of+1/−1. The two sequences have the same numberof elements. Pn−1 knows whether or not gi is largerthan gj by checking the corresponding value in theϕ′′ sequence. For example, if the first element ϕ′′

is −1, Pn−1 concludes gi < gj . Pn−1 examinesthe two sequences and constructs the index table(Table 1) to compute the largest element.

g1 g2 g3 · · · gk

g1 +1 +1 -1 · · · -1g2 -1 +1 -1 · · · -1g3 +1 +1 +1 · · · +1· · · · · · · · · · · · · · · · · ·gk +1 +1 -1 · · · +1

Table 1.

S1 S2 S3 S4 WeightS1 +1 -1 -1 -1 -2S2 +1 +1 -1 +1 +2S3 +1 +1 +1 +1 +4S4 +1 -1 -1 +1 0

Table 2.

In Table 1, +1 in entry ij indicates that the in-formation gain of the row (e.g., gi of the ith row)is not less than the information gain of a column(e.g., gj of the jth column); -1, otherwise. Pn−1

sums the index values of each row and uses thisnumber as the weight of the information gain inthat row. She then selects the one that corre-sponds to the largest weight.

To make it clearer, let’s illustrate it by an ex-ample. Assume that: (1) there are 4 informa-tion gains with g1 < g4 < g2 < g3; (2) the se-quence ϕ is [e(g1−g2), e(g1−g3), e(g1−g4), e(g2−g3), e(g2−g4), e(g3−g4)]. The sequence ϕ′′ will be[−1,−1,−1,−1, +1, +1]. According to ϕ and ϕ′′,Pn−1 builds the Table 2. From the table, Pn−1

knows g3 is the largest element since its weight,which is +4, is the largest.


Proof. In protocol 6, there is no data transmission.In protocol 7, we need prove it from two aspects:(1) Pn−1 doesn’t get information gain (e.g.,gi) foreach attribute. What Pn−1 gets are e(gi − gj) forall i, j ∈ [1, k], i > j and +1/ − 1 sequence. Bye(gi−gj), Pn−1 cannot know each information gainsince it is encrypted. By +1/ − 1 sequence, Pn−1

can only know whether or not gi is greater than Pj .


(2) Pn doesn’t obtain information gain for eachattribute either. Since the sequence of e(gi−gj) israndomized before being send to Pn who can onlyknow the sequence of gi − gj , he can’t get eachindividual information gain. Thus private dataare not revealed.

Theorem 9. (Efficiency). The computation ofprotocol 6 and protocol 7 is efficient from bothcomputation and communication point of view.

Proof. The total communication cost is upperbounded by αm2. The total computation cost isupper bounded by m2 + m + 1. Therefore, theprotocols are very fast.

3 Overall Discussion

Our privacy-preserving classification systemcontains several components. In Section 2.4.1, weshow how to correctly compute e(Entropy(Sv)).In Section 2.4.2, we present protocols to compute|Sv ||S| Entropy(Sv). In Section 2.4.3, we show howto compute information gain for each candidateattribute; we then describe how to obtain the at-tribute with the largest information gain. We dis-cussed the correctness of the computation in eachsection. Overall correctness is also guaranteed.

As for the privacy protection, all the communi-cations between the parties are encrypted, there-fore, the parties who has no decryption key can-not gain anything out of the communication. Onthe other hand, there are some communication be-tween the key generator and other parties. Al-though the communications are still encrypted,the key generator may gain some useful informa-tion. However, we guarantee that the key gener-ator cannot gain the private data by adding ran-dom numbers in the original encrypted data sothat even if the key generator get the intermedi-ate results, there is little possibility that he canknow the intermediate results. Therefore, the pri-vate data are securely protected with overwhelm-ing probability.

4 Conclusion

Prior to conclude this paper. We describe themost related works. In early work on privacy-preserving data mining, Lindell and Pinkas [14]propose a solution to privacy-preserving classi-fication problem using oblivious transfer proto-col, a powerful tool developed by secure multi-party computation (SMC) research [22, 10]. Thetechniques based on SMC for efficiently dealingwith large data sets have been addressed in [21].Randomization approaches were firstly proposedby Agrawal and Srikant in [2] to solve privacy-preserving data mining problem. Researchersproposed more random perturbation-based tech-niques to tackle the problems (e.g., [5, 19, 7]).In addition to perturbation, aggregation of datavalues [20] provides another alternative to maskthe actual data values. In [1], authors studiedthe problem of computing the kth-ranked element.Dwork and Nissim [6] showed how to learn cer-tain types of boolean functions from statisticaldatabases in terms of a measure of probability dif-ference with respect to probabilistic implication,where data are perturbed with noise for the re-lease of statistics.

The problem we are studying is actually aspecial case of a more general problem, the Se-cure Multi-party Computation (SMC) problem.Briefly, a SMC problem deals with computing anyfunction on any input, in a distributed networkwhere each participant holds one of the inputs,while ensuring that no more information is re-vealed to a participant in the computation thancan be inferred from that participant’s input andoutput [12]. The SMC problem literature is exten-sive, having been introduced by Yao [22] and ex-panded by Goldreich, Micali, and Wigderson [11]and others [8]. It has been proved that for anyfunction, there is a secure multi-party computa-tion solution [10]. The approach used is as follows:the function F to be computed is first representedas a combinatorial circuit, and then the partiesrun a short protocol for every gate in the circuit.Every participant gets corresponding shares of theinput wires and the output wires for every gate.This approach, though appealing in its generality


and simplicity, means that the size of the protocoldepends on the size of the circuit, which dependson the size of the input. This is highly inefficientfor large inputs, as in data mining. It has beenwell accepted that for special cases of computa-tions, special solutions should be developed for ef-ficiency reasons.

In this paper, we provide a novel solution for de-cision tree clasification over vertically partitionedprivate data. Instead of using data transforma-tion, we define a protocol using homomorphic en-cryption to exchange the data while keeping it pri-vate. Our classification system is quite efficientthat can be envisioned by the communication andcomputation complexity. The total communica-tion complexity is upper bounded by α(mnN +3mo−mN +m2+12). The computation complex-ity is upper bounded by mnN +5mt+m2+5m+9.

References

[1] G. Aggarwal, N. Mishra, and B. Pinkas. Securecomputation of the k th-ranked element. In EU-ROCRYPT pp 40-55, 2004.

[2] R. Agrawal and R. Srikant. Privacy-preservingdata mining. In Proceedings of the ACM SIGMODConference on Management of Data, pages 439–450. ACM Press, May 2000.

[3] J. Benaloh. Dense probabilistic encryption. InProceedings of the Workshop on Selected Areasof Cryptography, pp. 120-128, Kingston, Ontario,May, 1994.

[4] J. Domingo-Ferrer. A provably secure additiveand multiplicative privacy homomorphism. In In-formation Security Conference, 471-483, 2002.

[5] W. Du and Z. Zhan. Using randomized responsetechniques for privacy-preserving data mining. InProceedings of The 9th ACM SIGKDD Interna-tional Conference on Knowledge Discovery andData Mining, Washington, DC, USA, August 24-27 2003.

[6] C. Dwork and K. Nissim. Privacy-preservingdatamining on vertically partitioned databases. InCRYPTO 2004 528–544.

[7] A. Evfmievski, J. Gehrke, and R. Srikant. Lim-iting privacy breaches in privacy preserving datamining. In Proceedings of the Twenty-second ACM

SIGMOD-SIGACT-SIGART symposium on Prin-ciples of database systems, pages 211-222, SanDiego, CA, June 9-12, 2003.

[8] M. Franklin, Z. Galil, and M. Yung. An overviewof secure distributed computing. Technical Re-port TR CUCS-00892, Department of ComputerScience, Columbia University, 1992.

[9] B. Goethals, S. Laur, H. Lipmaa, andT. Mielikainen. On secure scalar productcomputation for privacy-preserving data mining.In Proceedings of The 7th Annual InternationalConference in Information Security and Cryptol-ogy (ICISC 2004), volume 3506 of Lecture Notesin Computer Science, pages 104–120, Seoul,Korea, December 2–3, 2004, Springer-Verlag,2004.

[10] O. Goldreich. Secure multi-party computation (working draft).http://www.wisdom.weizmann.ac.il/home/oded/public html/foc.html, 1998.

[11] O. Goldreich, S. Micali, and A. Wigderson. How toplay any mental game. In Proceedings of the 19thAnnual ACM Symposium on Theory of Comput-ing, pages 218–229, 1987.

[12] S. Goldwasser. Multi-party computations: Pastand present. In Proceedings of the 16th AnnualACM Symposium on Principles of DistributedComputing, Santa Barbara, CA USA, August 21-24 1997.

[13] J. Han and M. Kamber. Data Mining Conceptsand Techniques. Morgan Kaufmann Publishers,2001.

[14] Y. Lindell and B. Pinkas. Privacy preserving datamining. In Advances in Cryptology - Crypto2000,Lecture Notes in Computer Science, Volume 1880,2000.

[15] D. Naccache and J. Stern. A new public key cryp-tosystem based on higher residues. In Proceed-ings of the 5th ACM conference on Computer andCommunication Security, pp. 59-66, San Fran-cisco, California, United States, 1998.

[16] T. Okamoto and S. Uchiyama. A new public-key cryptosystem as secure as factoring. In Eu-rocrypt’98, LNCS 1403, pp.308-318, 1998.

[17] P. Paillier. Public key cryptosystems based oncomposite degree residuosity classes. In In Ad-vances in Cryptology - Eurocrypt ’99 Proceed-ings, LNCS 1592, pages 223-238. Springer-Verlag,1999.


[18] R. Rivest, L. Adleman, and M. Dertouzos. Ondata banks and privacy homomorphisms. In Foun-dations of Secure Computation, eds. R. A. De-Millo et al., Academic Press, pp. 169-179., 1978.

[19] Shariq Rizvi and Jayant R. Haritsa. Maintain-ing data privacy in association rule mining. InProceedings of the 28th VLDB Conference, HongKong, China, 2002.

[20] L. Sweeney. k-anonymity: a model for protectingprivacy. In International Journal on Uncertainty,Fuzziness and Knowledge-based Systems 10 (5),pp 557–570, 2002.

[21] J. Vaidya and C. Clifton. Privacy preserving asso-ciation rule mining in vertically partitioned data.In Proceedings of the Eighth ACM SIGKDD Inter-national Conference on Knowledge Discovery andData Mining, pages 639- 644, Edmonton, Alberta,Canada, July 23-26, 2002.

[22] A. C. Yao. Protocols for secure computations. InProceedings of the 23rd Annual IEEE Symposiumon Foundations of Computer Science, 1982.


Data Mining for Adaptive Web Cache Maintenance

Sujaa Rani Mohan, E.K. Park, Yijie HanUniversity of Missouri, Kansas City

{srmhv7 | ekpark | hanyij }@umkc.edu

Abstract

Proxy web caching is commonly implemented todecrease web access latency, internet bandwidth costs andorigin web server load. Data mining techniques such asURL(Universal Resource Locator) and web-contentmining, etc. improve the web-cache usage but addoverhead to clients and/or routers. Moreover changingweb access patterns dictate what would need to be cachedin the web proxy caches. Designing the configurationsettings that maintain optimal performance for a proxycache system requires an adaptive cache configurationmechanism. A transparent shareable proxy caching systemis proposed in which its system of proxy caches adaptthemselves to changing web access patterns. We proposean algorithm that mines client web access patterns toclassify clients, reconfigure proxy caches, and assignclients to proxy caches. Four light weight agents, in asystem of proxy caches and a web cache server, ensureoptimal use of computer resources, and significantlyincreased cache performance.

1. Introduction

Internet Service Providers (ISPs) are the commercialservice companies that provide internet access toindividuals and/or enterprises. Usually a client connects toan ISP through a modem/cable after establishing anaccount with them. Some providers only offer a basicconnection to the Internet while others provide standardservices. With the number of ISPs growing on a daily basisaround the world, there is room for plenty of competitionfor acquiring more subscribers. The key to getting moreclients is by being able to provide cheaper subscriptionrates with better accessibility. Web caching, among othertechniques, is commonly used to significantly reduceinternet costs. In [2], the authors study various trends andtechniques in web caching. The proxy caches themselvesare located at the backbone routers which switch incomingrequests from clients, either randomly or more recentlybased on data mining techniques [2] such as URL

(Universal Resource Locator) text mining, web contentmining, etc.

Consider the general scenario in which caches andclients are deployed by a standard ISP (see Figure 1). Theclients of an ISP are connected through a mesh ofinterception routers to the backbone router. This backbonerouter switches the client’s requests to the appropriate web cache usually the nearest (least cost) one. If this cache isnot able to serve the request then the request is broadcast toits neighbor caches. If the neighbor caches still can notserve the request then the request is sent to the origin webserver. This helps in saving internet bandwidth anddecreases the web access latency by avoiding repeatedcalls to the origin web server.

2. Current and related work

One of the main concerns for the ISPs is to find anoptimal configuration for the proxy caches to best utilizethe available resources. Proxy web caches are temporarystores of frequently accessed web objects at anintermediate place between the origin web server and theclient as opposed to browser caches which are present onthe client machine. Web objects can be data or image filesdownloaded from the origin web server. Also large filescan be downloaded and stored in the cache therebydecreasing the time to download the file from the web. Thesize of a proxy web cache is however limited. There aremany efficient techniques such as web prefetching [9] andcache replacement [11] which help in deciding what needsto be kept in a cache and what needs to be replaced. In [4],an efficient cache replacement algorithm for the integrationof web caching and prefetching is proposed. A Web cacheserver maintains the system of proxy caches and isresponsible for their smooth operation. It helps configuringthe proxy caches based on the type of pre-fetchingtechnique used, replacement strategy followed, etc. This isa tedious process and there is no single configurationpolicy that exists to maintain optimal performance as the


performance highly depends on user web access patterns.An Adaptive mechanism is hence required.

2.1. Data mining

Data mining is the analysis of data to establishrelationships and identify hidden patterns of data whichotherwise would go unnoticed. Web usage patterns havebeen mined for site evaluations [14]. These approacheshowever overload the caches and/or the backbone routers.Backbone routers form the major connection to the outsidenetworks and hence overloading the backbone routers willnoticeably slow down the network. Data mining has beenextensively used in improving the overall web experiencebased on the web usage patterns it identifies. In [9], theauthors employ an efficient pattern based approach forprefetching web objects. In [19], data mining is used tomine web log data to perform predictive prefetching ofURLs that a client may request in the future based on pastrequests.

5 U Backbone Router of ISP

Web Server

Sharable ProxyWeb Caches

1…….k

Client Machines 1…….n

Web Server

Web Cache Server

Figure 1. General proxy cache deployment

2.2. Caching Architecture

A common caching system used comprises of a Cacheserver which manages a group of proxy caches workingtogether to serve a number of clients. There are severalcaching architectures, each with its own advantages anddisadvantages. For example, though hierarchical cacheshave configuration hassles and unnecessary delaysincluding issues with security, it has shorter connectiontimes and lower bandwidth usage than distributed caching[13]. [16] discusses using cluster caches wherein a URL isplaced in any one of the cluster of caches. This alsoinvolves configuration problems. Hybrid Caches are morefavored as they combine the best features among theexisting architectures but are difficult to implement and

maintain [2]. Web caching also involves differentstrategies for prefetching URLs, replacement of staleobjects in the cache, etc.

Many Cache servers support different types ofarchitectures and also allow following different cachereplacement and prefetching strategies based on how theconfigurations are set. Such static configurations howeverfail to maintain a consistent cache performance whenoverloaded or for varying users’ web access patterns. Web caching performance also depends on resource availabilityand cost, etc. which was not considered by them. Therehave also been some approaches which dynamicallychange the way caching is performed. In [7], a webcaching agent acquires knowledge about web objects todeduce effective caching strategies dynamically. The cachehit ratio increased by 20% on an average when theiradaptive admittance algorithm was used as opposed totraditional replacement methods. A prototypical systemwas designed and developed to support data warehousingof web log data, extraction of data mining models andsimulation of web caching algorithms [3]. They have usedan external agent to update their data mining modeldynamically whenever the performance declines. Theirresults show that their proposed method gives a 50-75%increase in performance when compared to traditionalreplacement algorithms such as LRU. An association rulebased approach is used to predict the web objects to retainin the cache [18]. Since the algorithm runs on every webobject requested, there is a substantial overhead involved.

Such approaches however neither make best use ofavailable resources nor preserve an optimal cacheperformance. An Adaptive Approach which dynamicallychanges the policies used to manage the proxy caches isneeded. Our paper proposes a framework for such anapproach to personalize the configuration of proxy cachesbased on users’ varying web access patterns thereby achieving both optimal resource utilization and optimalcache performance. In our approach, we have looselyextended the web personalization concept to web caches topersonalize what the web cache contains and how itmanages the data in accordance to clients’ changing needs. In more specific terms it regulates the use of cacheresources, the cache replacement strategy, the prefetchingtechnique used, etc. as per the clients’ web access patterns. Usually proxy caches are configurable and the networkadministrator can specify the cache parameters. Decidingthe most suitable parameter options is difficult especially ifthe users’ web access patterns cannot be predicted. Ourpaper uses agents to set up the proxy cache system andmaintain their cache parameters for best utilization.

The paper is organized as follows. Section 3 presentsour approach which includes a detailed description of theassociation rule based rule set definition for identifyingconfiguration settings necessary for the proxy caches foroptimal cache performance. A brief description of thecaching architecture and the multi-agent system including


an algorithm showing the collaborated working of thevarious agents is also given. Section 4 gives a verificationfor the feasibility of our approach and a brief discussion ofthe results. We conclude with our ongoing work and adiscussion on the scalability of this approach to differentcaching architectures in Section 5. In this paper, we haveused the terms client and user, and itemset and rule setinterchangeably.

3. Proposed Approach

Our approach studies the user web access patterns andcustomizes how the cache is configured for cachereplacements, web object prefetching, etc. in apersonalized way so that there is a balance betweeneffective network usage with the given resources. This isbased on the framework suggested in [15]. Our approachhas been developed with the following designconsiderations in mind:a. Use an existing caching architecture so that ISP’s do

not need to change the existing costly deployment ofproxy caches

b. Use a data mining technique to mine for web patternsthat will not overload available resources

c. There should be no re-configurations necessary foreither the clients or the routers (transparency)

A multi agent system employs classification algorithmsand data mining techniques to study the client’s web access patterns and configure caches to best serve the clients withavailable resources. Also a performance monitor maintainsthe web cache performance within acceptable limits at alltimes. The proposed solution uses intelligent agents and avariation of the Apriori algorithm [1] for web accesspattern mining, cache and performance maintenance.

Some basic assumptions made are given below:a. The clients have varying web access patterns and

using a single cache configuration for all proxy cacheswould decline cache performance drastically.

b. There may be other configuration rules based onwhich a cache is set up besides those considered here.Usually such rules are rigid and cannot be changedbased on web access patterns.

c. The proxy caches make their web objects shareableamong themselves. Also the web cache server, whichis one selected among the various proxy serversshould be able to have all rights on the remainingproxy caches.

d. The proxy caches may be located anywhere on thenetwork based on other parameters such as cost,location.

We have approached this problem as a patternmatching problem of client patterns and proxy cachesettings.

The concept of association rules also works similarly.An Association Rule is an implication of the form A -> B,where A and B are two sets of items that occur together inmany instances. In our scenario, A will correspond tocache configuration settings and B will correspond to userweb access pattern items and there will be an implicationA -> B if and only if a setting A allows a hit for a requestof type B. Every client request goes to one of the manyproxy caches and checks for the presence of a fresh copy inthat cache failing which the requested page is fetched fromthe origin web server.

We define a client request type based on manyparameters such as the size of file requested, number ofrequests in unit time, etc.

Cache Settings are properties that define Cacheoperations such as memory usage, pre-fetching technique,allowed upload file size, the number of DNS lookupprocesses allotted for this type of clientele, etc.

The Apriori algorithm has become a well knownstandard for identifying patterns using association rules.[Agrawal et al., 1996] Its main disadvantage is that, if apattern of length n is needed then n passes are neededthrough the items. This can become a large overhead forour current application. In [6], the authors describe anefficient approach to perform incremental mining ofFrequent Sequence Patterns in Web logs. The approach wehave used in this paper is a variation of the Apriorialgorithm [Agrawal et al., 1994] which identifies frequentrule sets using a pattern repository in linear time [12]. Themain advantage of this approach is the ease of updating therule set and scaling. New frequent rule sets added to therepository can be used immediately.

We extend this approach for identifying frequent rulesets for proxy cache configuration settings (each uniqueproxy cache setting is obtained from a frequent rule set).The Configuration Repository contains all the frequent rulesets. A variation of their approach extended to ourapplication is explained in detail below:Initial Set-up:

1. A list of relevant cache settings (settingsirrelevant to user access patterns need not beconsidered here) are obtained from the cacheserver /proxy cache system used (SQUID cacheserver was used in our case) and the defaultsetting options for each parameter are defined.

2. A list of tokens are defined from the proxy cachesettings (ex. 1-File Size 100Kb, 80KB, etc.) 1a,1b, 2a, 3a, 3b, 3c. Each token has two charactersXY (X- unique identifier of setting, Y- takes avalue from a-z uniquely identifying a value withwhich a setting can be set). Each token is alsoassociated with a support value ranging from 0 to1. The default value that a setting takes is a.

3. All Proxy caches are configured using a defaultvalues for all settings and n clients are assigned toproxy caches at random.


Once the web cache logs have about 10,000 ormore entries, the initial rule set definition is done asfollows:4. The Proxy caches log every client request as hits

or misses. It also includes information regardingthe type of hit or miss and time stamps. Theselogs are initially collected by setting all proxycaches to default setting and running the cachesystem for a period of time. From these logs wecan identify the settings relevant to each clientbased on a simple decision tree i.e. If Client Arequests for web pages with high graphic contenta cache with larger memory is more suitable.From the web cache logs, n client-sets areidentified. Each client-set gives a list of items thatare most closely associated to their web accessbehavior. A direct association between clientbehavior and items can be seen in Table1. In asingle pass through these client-sets a count of thenumber of times each setting-option occurred isobtained. The setting-options are ordered indescending order of the counts.

5. An itemset is identified for each cache setting.These itemsets initially contain only one item (theitem they are defined for).We shall use the termitem to represent a single cache setting anditemset to represent a set of cache configurationsettings. Items are added to the itemset based onthe minimum support level which is defined as theration of the number of times the item occurs inthe set to the number of client-sets. Passes aremade through client-sets to identify maximumlength frequent itemsets. The number of passeswill at most be the number of cache settings takeninto consideration. At each pass a decision ismade to either add or reject a new item. Oncerejected it can no longer be added in the itemset.(For example, item 1a’s support with all othersettings 2s, 3s, etc, is checked then 1b with allother settings, etc.). Only those itemsets whosesupport is greater than the minimum support levelare retained in the list for the next pass. At the endof the n or less than n passes we will be able toget the maximum length rule sets which haveminimum support level. Default item options areadded to these frequent rule sets that are notcomplete i.e. they do not have a token defined.

6. The rule sets are then filtered against a list ofinvalid itemsets(set of items that cannot occurtogether and are defined by the Internet serviceprovider at configuration time as and whenresources change). Also the number of proxycaches (MaxProxyCount) available is set by theInternet service provider. If MaxProxyCount isgreater than the number of frequent rule sets

identified then MaxProxyCount rule sets withhighest support rule sets become the resultingClient Type Configurations (CTC). This final listof CTCs is stored in a repository calledConfiguration Repository. Each proxy cache isconfigured with respect to a CTC in therepository. More than one proxy cache may beconfigured with the same rule set depending uponthe number of clients. As can be seen there will beat most as many frequent rule sets as there areproxy caches.

Defining the Client Token Set and updating theConfiguration Repository:

This step is initially done to identify client userpatterns and to match them to the configuration ruleset best suited to them. Once the system is up andrunning, new rule sets are added to the ConfigurationRepository by running the following steps wheneverperformance declines beyond the threshold limit.7. Each Clients’ client-set is then used to identify the

CTC it most closely adheres to. Clients assignedto appropriate proxy caches based on the CTC.Now that we have a rule set for every client andvalid frequent rule sets we can easily matchclients with the best suitable proxy cache. Aswitch list records the Client and the Cache it isconnected to. This is sent to the backbone router.Also the token sets defined by clients iscontinuously monitored for any frequentlyoccurring rule set which is not present in theConfiguration Repository. Updating thisrepository occurs by monitoring new frequent rulesets identified from the logs and ranking themagainst those existing in the repository. Only thehighest ranked frequent rule sets define the CTCswith which the proxy caches are defined.

The following section explains the actual deployment ofthis association rule based algorithm in the form of lightweight agents in a system of Proxy cache and the MasterCache server.

3.1. Caching Architecture

In this paper we use a distributed caching architectureto implement our personalized caching approach. Figure 2gives the system deployment for our approach. A briefdescription of the components is given below.

3.1.1. Web cache server (Master Cache). The web cacheserver does not cache anything. It manages the system ofproxy web caches. It also contains the ConfigurationRepository explained above, a temporary store of pre-processed log data in the form of rule sets for each client asreceived from individual proxy caches and a client toproxy cache switch list that it maintains and updates therouters with. The classifying agent and the cache


maintenance agent are deployed here. These agentsmanage the switch list(used by the backbone router toroute client requests) which contains the information ofwhich client’s request need to be redirected to whichproxycache. They are responsible for identifying the need forand reconfiguring proxy caches whenever the performancedeclines. The web cache server is located at the same levelas the shareable Proxy caches but does not cache any newweb objects and does not have any clients attached to it.

5 U Backbone Router of ISP

Web Server

Sharable ProxyWeb Caches

1…….k

Client Machines 1…….n

3. Classifying Agent4. Cache MaintenanceAgent

1. Log Data preprocessing Agent2. CachePerformance Agent

Web Cache Server

Web Server

Figure 2. Multi-agent deployment in proxy cache set up

3.1.2. Proxy Web Caches. Initially all proxy caches areconfigured using default settings and client requests arerandomly assigned to a proxy cache. Once the initial set upis run and the base rule set is formed the proxy caches arereconfigured based on the rule sets and clients arereassigned accordingly. The performance is continuouslymonitored and changes to client assignment and cacheconfiguration are made by the Master Cache server. Theproxy caches manned by a Master Cache server areshareable among themselves. The log data pre-processingagent and cache performance agent are deployed at theproxy web caches and are responsible to trigger the MasterCache with any performance declines. Most of theprocessing occurs in the system of proxy web caches andweb cache server.

3.1.3. Clients. Clients may be a single point of access suchas standard home computers or a group of computers like asmall office building or apartment. The machines bythemselves do not require changes in their browser settingsor additional computer resources for our approach. Therouting switch will take care of any client-to-proxy cacheredirections. Hence there is no overhead at the client.

3.1.4. Web Server. This is the origin WWW server thatstores web documents and manages the delivery of webpages over the internet.Transparent Inter-proxy cache cooperation

Cache communication is via simple IP multicasts. Allcaches (1…k) work by inter-proxy cooperation. The cachesare set up at a core backbone proxy that can potentiallyserve all ISP clients and this is the farthest position werecaches can be located for the maximum benefit. Locatingthese caches at the focal point is not necessary as it is toofar away from the outside network and not much can becached [5] [8]. All client requests are routed through theproxy caches. This occurs transparent to the clients. Theclient requests are sent to the appropriate proxies throughinterception at the backbone routers based on their IPaddresses. The switch list of clients to proxy cachemapping is prepared by the classifying agent. Thisarrangement allows for complete transparency betweenclients and the working of the proxy caches. For example,if client X configured to cache A requests a URL cached incache C (this is found through inter proxy cachecooperation using multicasts), the data is sent to client Xfrom cache A through the switch (routed to) transparent toclient X.

3.2. Multi-Agent System

Agents perceive and act based on their environment.Our approach uses light-weight agents which allow runtime addition of new capabilities to the system. This willhelp in adding additional security and performance criteriachecks later on. These agents help identify client webaccess patterns, map the patterns to cache configurationrules, allow adaptive cache re-configurations and allow anassignment of clients to cache which shall give an optimalperformance of given resources at all times. These agentswork in the background and do not hinder the cachingprotocol at any stage even when the caches are re-configured. They are simply the core backbone forpersonalizing the cache usage for each client.

The proxy caches are initialized with default settings.After the initial set up, the agents begin to work in thebackground. The Multi-agent system comprises ofautomatic agents that run continuously and semi-automaticagents which only run when another agent triggers them.An overall working of the various agents is shown in thealgorithm in Figure 3. A brief description of their featuresand how they work together is described below.

3.2.1. Automatic agents.a. Log data pre-processing agent. Every proxy cache has alog of all the requests that came in and how the request wasserved (whether the requested Web object was available inthe cache: Cache_Hit, Cache_Miss, Cache_Refresh_Hit,whether the requested web object was denied though it waspresent in the cache, etc.). A lot of useful user features can


be mined from these logs. The agent continuouslyprocesses the log data for the current rule set for the clientand sends the current rule set list for all clients to theMaster Cache on demand. This is used by the cachemaintenance agent to update the Configuration Repositoryand the client-proxy cache assignment.b. Cache performance agent. The performance of a cacheis measured by its ability to serve a client’s request asefficiently as the origin web server. The Cacheperformance agent is deployed at every proxy cache wherethe performance is measured for every cache cycle.

This agent triggers the classifying agent or cachemaintenance agent based on the results it obtains. Forevery pre-determined cache cycle, various performanceparameters such as the throughput, mean response time,error rate, queuing of requests, connection length, objectsize, cache run-time hit ratio and other parameters aremonitored by studying the cache logs. It must be noted thatthe main performance measure, the cache run-time hit ratio(%), is similar to the cache hit ratio which measures theratio of the amount of requests served from the cache, as apercentage of total successful requests serviced but onlyfor the most recent (about 10,000) requests serviced. Thiswill give a more accurate estimate since the cache contentchanges dynamically.

A steady decline in the cache performance (randomfluctuations in performance are ignored/filtered as they areusually not the cause for poor cache performance) triggersthe classifying agent to modify the re-assignment of clientsto caches based on the clients’ current web access patterns. If this still does not improve performance then the cachemaintenance agent is called to update the ConfigurationRepository and to re-configure the proxy caches based onthe updated rule sets.An algorithm snippet for the cache performance agentfollows:1. For every cache cycle2. Check performance parameters3. if performance declines below threshold4. if time since last_call of classifying_agent <

min_time_call5. call cache_maintenance_agent ()6. else if number of clients for each cache is not

within acceptable range7. call cache_maintenance_agent()8. else9. call classifying_agent()10. End For

3.2.2. Semi-automatic agents.a. Classifying agent. This agent is triggered by the cacheperformance agent to check the most recent rule set listsent by the proxy caches of its clients and re-assign theclients to different proxy caches as needed. This agent alsoprepares the new switch list and sends it to the backbonerouter to handle re-route future client requests.

An algorithm for the classifying agent follows:1. for every client i= 1….n2. cache[i] = find best cache match from cache 1…k using binary search decision trees3. Send new Switch list cache[] to Web cache Server toupdate back bone router

b. Cache maintenance agentThis agent, deployed at the Master cache, is triggered bythe cache performance agent and has access to all theproxy cache pre-processed logs containing most recent rulesets of clients. The purpose of this agent is to update theConfiguration Repository with new rule sets as the needbe. This is done as explained above.

Table 1. Sample itemsItem Description of unique client web access

feature1a request greater object size2a how many objects does the client request for

on an average ( basically total size of allobjects cached for the client in a givenperiod of time)

3a number of web objects with size > 4MB4a If latency time is higher for a request. i.e.

Loaded caches –more DNS lookup nprocesses need to be spawned

5a file upload sizes greater than 100KB6a refresh pattern: if pages expires too fast but

still needs to be cached6b refresh pattern: if pages need to be refreshed

even when they were refreshed only recently6c refresh pattern: If a new trend is identified7a Clients which have a longer time interval

between requests8a Haphazard clients. Overload cache with too

many requests in a given time interval.

4. Verification

In this paper we show the correctness of our approach byanalyzing log traces obtained from [10] and show how theperformance of a typical caching mechanism can beenhanced by dynamically varying the cache parametersusing our framework. For simplicity, we simulated thescenario using only 3 configuration settings (A,B,C). ItemA can have 2 values each of which can be represented as atoken (1a,1b) and item B can take 5 different settingswhich are represented by options(2a,2b,2c,2d,2e) and itemC can have 2 values. The support level for frequent itemswas set at 25% of the number of clients in our case 1000.Theoretically this means, one can configure the variousproxy caches in (2! * 5! * 2!) ways. But usually not allcombinations of configuration options are valid or haveminimum support. Once the CRTs have been identified,and stored in the Configuration Repository, the switch list


is created by matching the proxy caches to the clients.Table 1 and Table 2 show some sample items and itemsets.

Figure 3 Multi Agent system algorithm

Table 2. Sample frequent rule setsSample Rule Sets *1a,8a6a,8a1a, 2a, 6b,7a

* Settings that are not shown retain default values

We calculated the run time hit ratio from the cache logsfrom the raw data and then to the data processed by ourapproach. We calculate a hit if the request could besatisfied by the way a cache configuration has beendecided after the algorithm is run.Case 1: In a general scenario, all proxy caches are set upusing one rule set which has highest support. Wecalculated the run time hit ratio from the cache logsobtained from two separate weeks from a busy Internetservice provider WWW server available from [10]. Using adefault configuration setting for all proxy caches wecalculated an average of 88% hits for different trace sets.We only used the requests and type of requests as ourinput. We then applied our approach, computed the CRTsand calculated the run time hit ratio for the same traces.We ran the tests a number of times and obtained a best caseof 96% and an average case of 94%.Case 2: We then changed the users’ web access patterns using a biased randomizer and found the performancedegrade to as low as 66%. By applying our approach, thesame requests resulted in >90% run time hit ratio in all testruns.

Since we assume a higher limit for resources there canbe a margin of error possible. We also verified theeffectiveness of the agents by dynamically changing the

configuration for the web proxy caches. We used a SQUIDproxy cache server [17] and ran 2 proxy caches. We thensimulated the configuration changes obtained as a result ofrunning the rule set identification algorithm on the tracesand applied the changes dynamically. There was norelevant downtime for the proxy caches, the run time hitratio or the cache performance was not affectedsignificantly. Nevertheless we believe that an adaptiveapproach is the best solution for improving web cacheperformance and optimally using memory and networkresources.

5. Conclusion

This paper suggests a framework for dynamic cachemaintenance and the results will vary for different set ups.Also as mentioned before there may be network issuesdepending on the way it is set up. This framework once setup is shown to prove advantageous for both static anddynamically varying client web access patterns. It ensuresthat all resources are considered for best utilization withlittle overhead.

The major advantages of our approach are enumeratedbelow:a. A significant increase in cache run time hit ratio

(optimal performance depending on user web accesstrends)

b. No client side re-configuration needed.c. No added overhead at the backbone routersd. Light weight agents (stationary) –allows run time

addition of new capabilities to the systeme. Allows scalable web cachingf. Agents work with predictable run timeg. Though client requests are re-routed to different

caches, multiple copies of the same URL are not foundin different caches because caches are shareable

h. ISPs can propose different service packs to clients. Forexample, they can provide high-speed internet withfaster response time at higher cost (due to moreDomain Name System (DNS) lookup spawns, largercaches, more bandwidth, etc.).

i. Automatic stabilization of the proxy cache systemwhen new hardware is added, one or more proxycaches are added or removed for maintenance etc. withminimal manual intervention One of the possibleproblems of our approach is that the switch list mightconflict with the routing protocol in which case therouter would just ignore the switch list. The switch listshould be configured not to overload any proxy cachesetc.

The proposed web caching framework can be adoptedfor almost any type of caching architecture. This approachcan also be scaled to a higher level in which the web cache

1 Among a set of proxy caches (k+1) located at the backbonerouter of the ISP choose one to be the server(Master cache)

2 n: number of clients for an ISP (assume there is only onebackbone router for this ISP)

3 Initially assign kn / clients to each proxy web cache.

4 Wait for one cache cycle5 Start log_data_pre-processing_agent to continuously obtain

client rule sets6 Start cache_performance_agent to continuously monitor

cache_performance7 cache_maintenance_agent is initialized to set default

configuration for proxy1…k 8 for every new cache cycle9 do10 If cache performance is below threshold11 then if classifying agent last_call_time < x

milliseconds12 Call the cache_maintenance_agent13 else14 Call the classification_agent15 else16 continue


servers for different sets of proxy caches talk to each otherand share optimal configuration patterns. Security on theInternet has become a major concern, especially for largeenterprises, which prefer not to cache the web objects theyaccess rather than have their caches sniffed by others. Ourapproach can help configure proxy caches to permitauthenticated access to certain web cache objects byincreasing the security level setting thereby regulating webcache usage.

We are also extending this adaptive rule setidentification approach to detect malicious transactionsfrom database transaction logs in large databases,commonly accessed via web applications, over a network.Malicious activity occurs primarily due to dynamicallychanging access roles and poorly configured data basecaches.

6. References

[1] Rakesh Agrawal, Andreas Arning, Toni Bollinger, ManishMehta, John Shafer, Srikant Ramakrishnan, “The Quest DataMining System”,Proc. of the 2nd Int'l ACM Conference onKnowledge Discovery in Databases and Data Mining, Portland,Oregon, August 1996, pp. 244-249.

[2] Greg Barish and Kathia Obraczka, “World Wide Webcaching: trends and techniques”,Communications Magazine,IEEE, May 2000, pp. 38(5):178–184.

[3] Francesco Bonchi, Fosca Giannotti, Cristian Gozzi, GiuseppeManco, Mirco Nanni, Dino Pedreschi, C. Renso SalvatoreRuggieri,“Web log data warehousing and mining for intelligentweb caching”, Data and Knowledge Engineering (DKE)©Elsevier, October 2001, pp. 32(2):165-189.

[4] Cheng-Yue Chang and Ming-Syan Chen, “A new cachereplacement algorithm for the integration of web caching andprefetching”,Proc. of the eleventh international ACM Conferenceon Information and knowledge management, November 2002, pp.632–634.

[5] Bradley M. Duka, David Marwood, and Michael J. Feeley,“The Measured Access Characteristics of World-Wide-WebClient Proxy Caches”,Proc. of the 1997 USENIX Symposium onInternet Technologies and Systems. Monterey, CA. TechnicalReport TR-97-16. December 1997.

[6] Maged El-Sayed, Carolina Ruiz, and Elke A. Rundensteiner,“FS-Miner: Efficient and Incremental Mining of FrequentSequence Patterns in Web logs”,Proc. of the ACM WIDM’04.Washington, DC, November 2004, pp. 12-13.

[7] Annie P. Foong, Yu-Hen Hu, and Dennis M. Heisey,“Adaptive Web caching using logistic regression”,Proceedingsof the 1999 IEEE Signal Processing Society Workshop on NeuralNetworks for Signal Processing IX, August 1999, pp. 515–524.

[8] Steven D. Gribble, Eric A. Brewer,“System Design Issues forInternet Middleware Services: Deductions from a Large ClientTrace”,Proc. of the 1997 USENIX Symposium on InternetTechnologies and Systems. Monterey, California, USA,December 1997, pp. 207-218.

[9] Jaeeun Jeon, Gunhoon Lee, Haengrae Cho, and ByoungchulAhn, “A prefetching Web caching method using adaptive searchpatterns”,IEEE Pacific Rim Conference on Communications,Computers and signal Processing (PACRIM), August 2003, Vol1, pp. 37-40,

[10] National Laboratory for Applied Network Research.Anonymized access logs, <ftp://ftp.ircache.net/Traces/>.

[11] Stefan Podlipnig and Laszlo Böszörmenyi, “A survey ofWeb cache replacement strategies”,ACM Computing Surveys(CSUR) archive, December 2003, 35(4): 374-398.

[12] Richard Relue and Xindong Wu, “Rule generation with thepattern repository”,Proc. of the IEEE International Conferenceon Artificial Intelligence Systems, September 2002, pp. 186 –191.

[13] Pablo Rodriguez, Christian Spanner, and Ernst W. Biersack,“Analysis of Web Caching Architectures”, Hierarchical andDistributed Caching IEEE/ACM TRANSACTIONS ONNETWORKING. August 2001, 9(4):404-418.

[14] Myra Spiliopoulou, “Web usage mining for Web siteevaluation”,Communications of the ACM, August 2000,43(8):127-134.

[15] Sujaa Rani Mohan, E.K. Park, Yijie Han, “Association Rule Based Data Mining Agents for Personalized Web Caching”, Proc. of the 29th Annual International Computer Software andApplications Conference, IEEE, Edinburgh, Scotland, July 2005.

[16] Jia Wang, “A Survey of Web Caching Schemes for theInternet”,ACM SIGCOMM Computer Communication Review,October 1999, 29(5): 36–46.

[17] Duane Wessels et al, “Squid Internet Object Cache”,National Laboratory for Applied Network Research<URL:http://squid.nlanr.net/>.

[18] Qiang Yang and Haining Henry Zhang, “Web-Log Miningfor Predictive Web Caching”,IEEE Transactions on Knowledgeand Engineering, August 2003, 15(4): 1050-1053.

[19] Venkata N. Padmanabhan, Jeffrey C. Mogul, “Using predictive prefetching to improve World Wider Web latency”, ACM SIGCOMM Computer Communication Review , July 1996,26(3): 22-36.


Temporal Intelligence for Multi-Agent DataMining in Wireless Sensor Networks

Sungrae Cho, Ardian Greca, Youming Li, and Wen-Ren Zhang

Department of Computer Sciences

Georgia Southern University

Statesboro, GA 30460

[email protected]

Tel: 912-486-7375; Fax: 912-486-7672

Abstract— In wireless sensor network, sensor nodes are func-tioning as autonomous, self-organizing multi-agents to provideuseful information to users. Yet, it is a challenging issue howautonomous but resource limited agents should be designed tomake them capable of helping each other in their data miningtasks. The identified limitations for sensor agents include powerconsumption and scalability. In this paper, we define wirelesssensor network from the perspective of multi-agent data miningand warehousing, and propose a temporal intelligent coordi-nation protocol to reduce power consumption and to providescalability. Simulation results show that the temporal intelligentcoordination protocol significantly lowers power consumptionand thus maximizes the network life time.

I. INTRODUCTION

Wireless sensor networks have drawn immense atten-

tions recently from industries and research institutions as

an enabling technology for invisible ubiquitous computing

arena [10]. Spurred by the rapid convergence of key tech-

nologies such as digital circuitry, wireless communications,

and micro electro mechanical systems (MEMS), a number

of components in a sensor node can be integrated into

a single chip with reduction in size, power consumption,

and cost [1]. These small sensor nodes could be deployed

in home, military, science, and industry applications such

as transportation, health care, disaster recovery, warfare,

security, industrial and building automation, and even space

exploration. By connecting these small sensor nodes by radio

links, the sensor nodes could perform tasks which traditional

sensor nodes are hard to match.

Albeit the applications enabled by wireless sensor net-

works are very attractive, one of the most frequently used

function would be data mining. In wireless sensor network,

sensor nodes are expected to operate as an agent to gather

useful information to remote users. This multi-agent system,

however, has to overcome several technical challenges. The

identified challenges include (1) scalability, (2) adaptabil-

ity, (3) addressing, and (4) energy-efficiency. Since sensor

networks consists of a large number of sensor nodes and

thus large amount of data will be produced, large-scale data

mining and warehousing techniques are needed.

Also, user constraints and environmental conditions, such

as ambient noise, topology change, and event arrival rate, can

be time-varying in wireless sensor networks. Thus, the system

should be able to adapt to these time-varying conditions.

Furthermore, sensor nodes may not have global identification

because of the large amount of overhead and the large

number of sensor nodes. Therefore, naming or addressing

is a challenging issue in wireless sensor networks.

In addition to these challenges, the energy consumption of

the underlying hardware and protocols is also of paramount

importance. Wireless sensor nodes are expected to be oper-

ated by battery. Because of the requirement of unattended

operation in remote or even potentially hostile locations,

sensor networks are extremely energy-limited. Energy op-

timization in the sensor networks is much more complex

since it involves not only reducing the energy consumption

of single sensor node but also maximizing the lifetime of an

entire network. The network lifetime can be maximized by

incorporating energy awareness into every stage of wireless

sensor network design and operation, thus empowering the


system with the ability to make dynamic tradeoffs between

energy consumption, system performance, and operational

fidelity [9].

Since various sensor nodes often detect common phenom-

ena, there is likely to be some redundancy in the sensory

data that the sources generate. In-network filtering and pro-

cessing technique can therefore help to conserve the scarce

energy resources. Data aggregation or data fusion has been

identified as an essential paradigm for wireless routing in

sensor networks [7]. The idea is to combine the data coming

from different sources en-route – eliminating redundancy,

minimizing the number of transmissions and thus saving

energy.

In this paper, we apply multi-agent data mining concept

to wireless sensor network and propose temporal intelligent

coordination protocol as a knowledge and coordination mech-

anism. The objective of the protocol is to further reduce

the energy consumption when data aggregation is involved.

In order to reduce the energy consumption, our protocol

employs an intelligent decision logic in the sensor agent

which defers or deactivates the transmission of its response.

To our best knowledge, this is the first research incorporating

multi-agent data mining concept into wireless sensor network

and providing temporal intelligent coordination protocol for

sensor agents.

The remainder of this paper is organized as follows.

Multi-agent system methodology is explained for wireless

sensor network in Section II. In Section III, the proposed

temporal intelligent coordination protocol is described. In

Section IV, we compare the energy-efficiency performance

of data aggregation with or without our protocol. Finally,

contributions and future work are discussed in Section V

II. MULTI-AGENT SYSTEM METHODOLOGY

There are number of methods have been proposed for

modeling of agents in a distributed environment [4], [6].

The widely used approach is to model agents based on BDI

(believe, desire, and intention) [6]. The sensor agent can be

a miner, a decision maker, a controller, or an actor that has

local or partial learning and decision capabilities, can manage

and uses its local data and knowledge, and can cooperate or

be coordinated with other agents for collective monitoring,

learning, and decision-making. The following sensor agent’s

activities can be identified [13]:

• identifying sensor agents and agent communities, e.g.,

sensor agent monitoring temperature,

• training new sensor agents using task assignment,

• dispatching sensor agents to their posts,

• deploying knowledge and coordination protocols,

• mining new knowledge including new coordination pro-

tocols.

In BDI approach [6], they see the problem in two per-

spectives: external and internal views. The external view

breaks the problem into two main components: the agents

themselves (agent model) and their collaboration or coordi-

nation. The internal view uses three models for agent class: an

agent model for defining relationships between agents, goal

model for describing goals, planning and scheduling models

to achieve agent goal. In any distributed environment, the

agents can be classified with particular roles according to

their capability description [4]. Agents may have persistent

roles – long term assignment as well as task specific role –

short term assignments.

From this two points of view, we can comprise the

multi-sensor-agent-based organization into two main mod-

els: agent/role model (agents’ capability and behavior) and

agent/role interaction model. Note that the agent/role inter-

action model can be defined to the level of individual query-

response and to associated data. To perform appropriate

response, a role can be defined with four general attributes:

responsibility, permission, activities, and protocols [4].

• Responsibility: sensor agent/role functionality can be

measured by responsibility assigned to it, when can

be divided into two categories: timeliness property and

security property. The timeliness property ensures the

task will be done by performing certain actions. For

example, to illustrate it further, we discuss the monitor-

ing responsibility of sensor agent/role. The timeliness

property in this case is to inform the relevant agent in

case of any updates in data resources. In this context,

an example of sensor agent responsibility might be

DataMonitor =

{Monitor.DataCollectionAgent,CheckTemperature.AwaitUpdate}.


Wide Area Network

Sink

User

Interest Transmission

Data

Sensor Field

Fig. 1. Phenomenon gathering.

This expression represents that DataMonitor consists

of execution protocol Monitor followed by the proto-

col DataCollectionAgent followed by the activity

CheckTemperature and a protocol AwaitUpdate.

In this case, the sensor agent will definitely be required

to ensure that the temperature needs to satisfy a certain

limitation, called its safety property, e.g., 70 ≤ temper-

ature ≤ 76.

• Permissions:are the rights associated with the roles to

realize their responsibility. This specification shows that

the sensor agent who carries out it role has permission

to access, read, and modify the data source.

• Activities: are private action/computational functional-

ity associated to its role.

• Protocols:define the mechanism for the roles to interact

or communicate each other.

In the following, we will discuss about how the roles are

interacted and communicated and how power consumption

is minimized in perspective of agent/role interaction model.

III. MULTI-AGENT TEMPORAL INTELLIGENT

COORDINATION PROTOCOL

A. Background

Sensor nodes are scattered densely in a sensor field as in

Fig. 1. A node called sink requests sensory information by

sending a query throughout the sensor field. This query is

received at sensor agents (or sources). When the agent finds

data matching the role (or query), the data (or response)

is routed back to the sink by a multihop infrastructureless

networked sensors. The information gathered in the sink

agent can be accessed by user via existing wide area networks

such as Internet or satellite networks [1].

This role dissemination and sensory data gathering can be

performed by the traditional address-centric approach where

the shortest path is found based on physical end address

as IP world has. The cost of an address in wireless sensor

networks can be considered high if the address space is

underutilized and the address space occupies greater portion

of the total bits transmitted. Globally unique address would

need to be very large compared to to typical size of data

attached to them. Also, maintaining local address would be

inefficient because more work is required to keep addresses

locally unique as the network topology changes dynamically.

In wireless sensor networks, a more favorable approach is

the data-centric routing. In the data-centric approach, role

dissemination is performed to assign the sensing tasks to the

sensor agents [1].

Data-centric routing requires an attribute-based nam-

ing [1], [8]. For the attribute based naming, the users are more

interested in querying an attribute of the phenomenon, rather

than querying an individual agent. For instance, “are there

any agent where the temperature is over 70 degree?” is a

more common query than ”what is the temperature measured

by a certain agent?” The attribute-based naming is used to

carry out queries by using the attributes of the phenomenon.

Coverage of deployed sensors will overlap to ensure robust

sensing task, so one event will likely trigger multiple sensors

in the same phenomenon. In this case, it is likely to receive

multiple identical copies of a sensory data. Also, some roles

inherit redundant responses as follows:

• Max: The sink is interested in gathering maximum value

from the sensor field. In this case, other values less than

the maximum are redundant.

• Min: The sink is interested in gathering minimum value

from the sensor field. In this case, other values greater

than the minimum are redundant.

• Existence: Some application needs to identify the ex-

istence of a target object. For example, in directed

diffusion [5], an initial query dissemination is used to

determine if there indeed are any sensor agents that

detect the interested object.

We refer the role with the above types as a singular

role which expects only one response from source agents.

Redundant and unnecessary responses will generate unnec-


essary transmission at the underlying layers. For example,

unnecessary response will cause high duty cycle at the

medium access control (MAC) layer which in turn generates

high contention from multiple agents. Consequently, sensor

agents suffer from unnecessary energy consumption.

B. Temporal Data Suppression

In this section, temporal intelligent coordination protocol

is proposed when sensor agent are collectively gathers sensor

data and reports them to the sink agent. The objective of the

scheme is to further reduce the energy consumption when

data aggregation is involved. In order to reduce the energy

consumption, our scheme employs an intelligent decision

logic in the sensor agent which defers or deactivates the trans-

mission of its response. The temporal intelligent coordination

protocol is performed as follows:

• The sink disseminates a role to its child agents with

(1) role type, (2) depth of the tree D, and (3) timer

parameter T . The sink then waits for DT for responses

from its child agents.

• Each agent simply forwards the role and waits for (D−d)T where d is the depth of the agent. By permitting

this waiting time, each agent is able to aggregate all the

responses from its child agents, and agents in the same

depth can be synchronized.

• When responses are received at source agent i from its

child agents during (D−d)T , it looks up the role type. If

the role type is not singular, then the agent immediately

sends its response back to its parent after (D − d)T .

If the role type is singular, it activates a timer after

(D−d)T with timer value Bi. Timer Bi is derived from

received timer parameter T . When timer expires, the

source agent transmits its response. If a response from

other source agents is received prior to timer expiration,

agenti compares the received response with its own

response. If agenti finds that its response is redundant,

it deactivates its timer.

IV. PERFORMANCE EVALUATION

To evaluate the energy-efficiency performance of temporal

intelligent coordination protocol, we developed a simulator

based on event-driven simulation using Java. The simulator

generates a random topology as follows. We assume that the

0 100 200 300 400 500 600 700 800 900 10000

20

40

60

80

100

120

R

Exp

ecte

d N

umbe

r of

Res

pons

es

T=5T=10T=20

Fig. 2. The number of responses vs. R.

sensors have a fixed radio range and are placed in a square

area randomly. The sensors form a network routing tree. This

tree is formed based on the proximity metric of each agent

using breadth first search tree [3]. The root of the tree (sink) is

randomly selected in the simulator. When we vary the number

of sensors, we vary the size of the area over which they are

distributed so as to keep the density of sensors constant. For

instance, we use a 1000 × 1000 area for 1000 sensors. For

4000 sensors, the dimensions are enlarged to 2000 × 2000.

Based on the tree formed, the sink disseminates a role

(with all role parameters as described in Section III) to its

child agents which forward this role to their children. This

process is continued until the role is reached to the deepest

agents. The depth of the tree is computed based on the tree

formed, and is used for response waiting time ((D− d)T ) at

each of the agent. When the role is reached to the deepest

agents, the deepest agents respond with their sensor reading

which are aggregated at their parent agent, and so on towards

the sink.

The sensor reading values are generated uniformly in range

of [10, 90] where we assume the minimum and maximum

sensor readings are 1 and 100, respectively. In other words,

the sink expects the sensor readings are from 1 to 100, but

actual readings are between 10 and 90. The choice of the

range is rather arbitrary, but we observed that expanding the

reading range does not affect the performance when we also

increase the agent density.


In Fig. 2, we show the number of responses versus the

number of agents R with different parameters T = 5, 10,

and 20 time units when the sink sends out the MAX role. We

observe that very small number of responses can be obtained

with different parameter T under larger number of source

agents R (at most 101 responses at T = 5 and R = 1000).

We show that with temporal intelligent coordination protocol

parameter T the number of responses can be significantly

reduced, i.e., energy-efficiency can be significantly improved.

Especially, the more T we have, the more improvement in

energy-efficiency we achieve. However, if we have a large T ,

the total latency in role processing increases. Therefore, the

recommended choice of T would be arg maxt<DSINK/D{t}where DSINK is the allowable maximum role latency at the

sink.

V. CONCLUSIONS AND FUTURE WORK

In this paper, we have applied multi-agent data mining

concepts to wireless sensor networks and proposed a tem-

poral intelligent coordination protocol. The objective of the

scheme is to further reduce the energy consumption when

the sink agent collects sensory information from the agents

in the sensor field. In order to reduce the energy consumption,

our scheme employed an intelligent decision logic in the

sensor agent which delays or deactivates the transmission of

its response.

Performance evaluation shows that data aggregation with

the intelligent coordination protocol significantly improves

energy-efficiency compared with other protocols. However,

this improvement have been made at the expense of the

increased latency. As a future work, we will investigate a

mechanism to reduce the delay while providing similar level

of energy-efficiency.

REFERENCES

[1] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, “ASurvey on Sensor Networks,” IEEE Communications Magazine, pp.102–114, August 2002.

[2] S. Cho, “On Timing Issue in Data Aggregation for Wireless SensorNetworks,” in Proc. of UWC, Tokyo, Japan, September 2005.

[3] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction toAlgorithms, MIT Press, Cambridge, Massachusetts, 1990.

[4] K. Decker, K. Sycara, and M. Williamson, “Modeling InformationAgents: Advertisements, Organizational Roles, and Dynamic Behav-ior,” In Proc. of AAAI Workshop on Agent Modeling, 1996.

[5] C. Intanagonwiwat, R. Govindan, and D. Estrin, “Directed Diffusion: AScalable and Robust Communication Paradigm for Sensor Networks,”in Proc. of ACM MOBICOM, Boston, MA, August 2000.

[6] M. Kinny, M. Georgeff, and A. Rao, “A Methodology and ModelingTechnique for Systems of BDI Agents,” Technical Report, AustralianArtificial Intelligence Institute, Melbourne, Australia, January 1996.

[7] B. Krishnamachari, D. Estrin, and S. Wicker, “The Impact of DataAggregation in Wireless Sensor Networks,” in Proc. of IEEE ICDCSW,Rome, Italy, July 2002.

[8] C. C. Shen, C. Srisathapornphat, and C. Jaikaeo, “Sensor InformationNetworking Architecture and Applications,” IEEE Personal Commu-nications, pp. 52–59, August 2001.

[9] E. Shih, S. Cho, N. Ickes, R. Min, A. Sinha, A. Wang, A. Chan-drakasan, “Physical Layer Driven Protocol and Algorithm Designfor Energy-efficient Wireless Sensor Networks,” in Proc. of ACMMOBICOM, Rome, Italy, July 2001.

[10] M. Weiser, “The Computer for the 21st Century,” Scientific American,September 1991.

[11] W. Zhang, “Nesting, Safety, Layering, and Autonomy: A Reorga-nizable Multiagent Cerebellar Architecture for Intelligent Control –With Application in Legged Locomotion and Gymnastics,” IEEETransactions on Systems, Man, and Cybernetics, Part B, vol. 28, no.3, pp. 357–375, 1998.

[12] W. Zhang, “A Multiagent Data Warehousing and Multiagent DataMining Approach to Cerebrum/Cerebellum Modeling,” in Proc. ofSPIE Conference on Data Mining and Knowledge Discovery, Orlando,FL, April 2002.

[13] W. Zhang and L. Zhang, “A Multiagent Data Warehousing (MADWH)and Multiagent Data Mining (MADM) Approach to Brain Modelingand Neurofuzzy Control,” Elsevier Information Sciences Journal, vol.167, pp. 109–127, 2004.


A Schema of Multiagent Negative Data Mining Fuhua Jiang, Yan-Qing Zhang, and A.P. Preethy

Department of Computer Science, Georgia State University Atlanta,GA 30302-3994

E-mail: {cscfhjx, yzhang , ppreet }@cs.gsu.edu

Abstract—The properties of training data set such as size, distribution and number of attributes significantly contribute to the generalization error of a learning machine. A data set not well-distributed is prone to lead to a model with partial overfitting. The approach introduced in this paper for binary classification enhances useful data information by mining negative data implemented by multiagent technique. Each agent is a representative which is either a traditional learning algorithm or a partitioning and combining mechanism. The natural property of positive and negative data divides a large data set into a series of data subsets which could lead to low cost computation in data mining. A schema of multiagent negative mining is studied.

I. INTRODUCTION

A multiagent system consists of a number of agents which interact with each other. These agents require the ability to cooperate, coordinate, and negotiate among themselves[1]. How can autonomous agents coordinate their activities is a major task on the large scale data intensive systems. Many systems such as biological systems, weather databases, and financial data systems with huge data sets need to be preprocessed before knowledge base can be extracted from data warehouse. Some techniques such as pattern-based similarity search, cluster analysis, decision tree based classification, and association rule mining [2, 3], have been developed over the past decade to process large centralized data sets. These approaches are not tailored for multiagent data mining. In an multiagent system each agent only processes some sets of the data [4]. Specific example is in distributed data mining where a large data set is scattered on various sites [2]. However, an agent cannot be efficient if the data set is extremely large because an agent is “single-minded” functioning on a single independent job. This paper focuses on the derivation of a data set partitioning schema which divides a large data set into several small data subsets in reasonable scale. Therefore, agents cooperate and work on the small data subsets can collectively achieve a large job. This work is similar to the cluster analysis which is a descriptive data mining approach to partitioning data set into homogeneous groups. The difference in negative data mining is the uneven positive and negative data sets. In the meantime, multiagent techniques are also employed in the schema to create data subsets.

To a specified hypothesis h∈H where H is hypotheses space, all examples in the data set in the universe space TS

are divided into two primal groups - positive and negative data sets. The positive data is the subset of all correctly classified examples whereas negative data is the rest. An example could be positive or negative. The negative data does not mean the data is wrong or corrupt. What negative data can be considered is that a hypothesis can not make it well-separated. Negative data strongly depends on the hypothesis. Whether an example is positive or negative is relative. To a specific example, hypothesis A classifies it as negative while hypothesis B may classify it as positive. Furthermore, even for the same hypothesis, an example probably belongs to positive or negative in terms of the different parameters α of a hypothesis h(x)=f(x, α).

Negative data contains the positive information. The reason of negative data existed is the current hypothesis can not correctly classifying the all examples. A hypothesis can be evaluated to be good or not by predicting accuracy in the unseen data so that it is meaningless to claim an example is good or not. Negative data is limited on the circumstances of the hypothesis applied. That indicates the negative data has the information which the current hypothesis does not mine. The more information, the higher the accuracy of a learn machine can be obtained. By mining negative data, the accuracy of the machine learning will be definitely improved. Introduction to Negative Data Driven Compensating Hypothesis Approach (NDDCHA) is used to demonstrate the multiagent technique.

The rest of this paper is organized as follows. In Section II, the algorithm of NDDCHA is introduced. In Section III, architecture of multiagent schema is studied. Finally in Section IV, the main contribution of this paper is summarized.

II. INTRODUCTION NEGATIVE DATA DRIVEN COMPENSATING HYPOTHESIS APPROACH (NDDCHA)

The training data set is partitioned into three disjoint subsets, misclassified, not-well-separated and well-separated examples. As shown on Fig. 1, the segment of hypothesis within the rectangle needs to be repaired. The other parts of the hyper-surface classify the positive data sets as well-separated and they have high generalization capacity. The misclassified and not-well-separated examples together are called the negative data subset whereas the well-separated examples are called the positive data subset. A hypothesis in the classification is a hyper-


surface in multidimensional space which is knowledge data represented as the model. For example, the hyper-surface in the input space is mapping to a hyperplane in the feature space by a kernel function in the SVM [5, 6]. Compensating a hypothesis has the same meaning as repairing a hyper-surface. A single hyper-surface h1(x) is not enough for the high predicting accuracy because any improvement in training accuracy will be prone to overfitting. To avoid the overfitting, the low degree of curve line h2(x) is preferred. However, h2(x) will lead to low predicting accuracy. Therefore, in order to improve the predicting accuracy or generalization capacity, the hyper-surface h1(x) needs to be repaired so that it approximates to the underlying function f(x).

NDDCHA uses a base hyper-surface to approximate the main outline of underlying function, which is created by a base learning algorithm. Furthermore, a number of patching hyper-surfaces are overlapped onto base hyper-surface to form a new hyper-surface. The patching information is from negative data set. The main problem of compensating base hyperostosis is which area of hyper-surface is needed to patch. In the Boosting and Bagging algorithm[7, 8], a voting strategy is applied to combine all hypotheses. This approach does not work because the patching hypothesis depends on the base learning hypothesis. The patching hypothesis uses negative data which is complement set of positive data set.

Training data set is denoted as S, and (x,y) is an example of S, then (x,y)∈S. x is an n-dimension vector in real space and y is the label. In binary classification, either y=+1 or y=-1. If an example (x,y) is correctly classified according to the classifier or separator h(x)=0, (x,y) is said to be in the consistent subset (x,y)∈CS⊆S, otherwise (x,y) is in the inconsistent subset IS=S-CS, (x,y)∈ IS. The function of partitioner p(h,x,y) is to return value of either true or false so that it can partition a data set. A partitioner accepts the hypothesis and a specific example as input. One simple example of a partitioner in the classification is the crisp boundary p(h, x, y): ||y-h(x)||≤ε, ε∈[0,0.5], then the not-well-separated data is NS={(x, y)| (x, y)∈S, ||y-h(x)||≤ε, ε∈[0,0.5]}. The boundary between positive data set and negative data set is called border. We can say the positive

and negative data set are divided by both separator h(x) and partitioner p(h,x,y). Let d(h,x,y) = [p(h(x), x, y)=true or y*h(x)≤0], then N= {(x,y) | (x,y)∈S, d(h,x,y)} and d(h,x,y) is denoted as a divider to divide the training data set into positive and negative data set.

Let training data set S0=S. It can be partitioned into two subsets according to a divider d(h,x,y), (x,y) ∈ S0,where h(x) is produced by a base learning algorithm. One subset is positive subset #

1S , which is a set of well-separated examples from S0 by the hypothesis h(0)(x)=h(x). The remaining of S0 is the negative subset with S1 satisfying 1

#10 SSS += . Let X be a collection of input

vectors from training set S, and Y be a vector consisting of all labels of training set:

X={x|(x,y)∈S} and Y={y|(x,y)∈S}. Let h(i)(x) be the patching negative models working on the training negative subset Si and d(i)(h(i),x,y) be dividers. And Si is the negative subset of Si-1 according to d(i-1)(h(i-1),x,y). Here h(x,i) is the comprehensive patching model. And h(x,k) is the final model providing to testing procedure.

kifor

xhixhixhxhxh

i..1

)()1,(),()()0,(

)(

)0(

=⎩⎨⎧

+−==

The sign + in above expression provides an overlap operation for two models that h(x,i-1) compensates h(x,i). Therefore the hypothesis h(x) of the NDDCHA approach

is 0,)(),(0

)( >= ∑=

kforxhkxhk

i

i . The training data sets

are defined as follows,

⎪⎪⎩

⎪⎪⎨

⎧

=−=

∈Δ−−Δ=ΔΔ∈∈Δ=

=

−

−

−−−

−

kiforSSS

Y}yand)h(x,iyyyxhX,d|xS)y{(x,S

SS

iii

ii

iii

iii

..1

,1),,,(

1#

01

1)1()1(

1

0

The labels on the training subset Si are the differences of predicated labels and expected labels. The hypothesis is produced by training on the residual data since the idea of NDDCHA is to compensate the base hypothesis each time. Since the above algorithm is iterated over k times it has to be regression learning algorithm. There are a total of 1+k passes in this algorithm. #

iS is the ith positive data subset and does not change during training. The final positive data isU

1

1#+

=

k

i iS .

In the testing phase, the hypotheses from training are used to create patching data to compensate the base hypothesis. The key point in the testing phase is to determine the suitable patching hypothesis. The function of vector set similarity VS accepts two data sets Si and Ti-1, one from the training data set and the other from the testing data set, to generate a subset of Ti from Ti-1. As a result, each vector x1

∈Ti-1 becomes similar to at least one vector x2∈Si, denoted as vs(x1, x2)≥ δ, where δ∈[0,1] is the degree of similarity. If

y

h(x1)

h(x2)

x1

x2

Fig. 1. A model h(x) can be considered as a hyper-surface. x1 is well-separated; x2 is not well-separated and x3 is misclassified, y1 , y2,, y3>0. The part of hyper-surface in the rectangle area needs to be repaired.

h(x)=0

h(x3) x3


x1=x2, then vs(x1, x2)= 1 whereas if x1≠x2, vs(x1, x2)= 0. Here Pi predicts labels on negative data set Ti

kixxvsSxTxifTSVST

TT

iiiii

..1],1,0[,),(,),,(

21

2111

0

=∈≥⎩⎨⎧

∈∃∈∀==

−−

δδ

⎪⎩

⎪⎨⎧

==≥∈∈=

=∈=−

kiixhyxxvsSxTxyxPxhyTxyxPiii

..1)},,(,),(,,|),{()}(,|),{(

121

2111

)0(00

δ

In above expressions, δ is the regulating parameter to control the degree of two vectors’ similarity. It can be seen that, Ti is similar to Si, so that h(i)(x) can be used for testing Ti .to generate the values of Pi . These values are overlapped to compensate labels as P#

i=OV(P#i,P#

i-1). The final output P#

k is the predicting label set. The output labels which are compensated value are given as follows.

kiPPOVP

PP

iii

..1),( #

1#

0#

0 =⎩⎨⎧

==

−

.

It can be seen that in the training phase the learner uses the hypotheses h(x)=h(x,i) together with the partitioning function p(i)(h(i),x,y) as divider, generating a positive group S#(k)=U

1

1#+

=

k

i iS and a negative group Sk+1=S-S#(k). We find

a subset of testing data Ti, which is similar to Si and use the hypotheses produced on Si for testing Ti. The ‘+’ operation is one case of OV function, and hence the final testing result is treated as a summation of overlapping function ∑

=

=k

iik PP

1

## .

In the data mining, the issue concerned is knowledge data which is a series of hypotheses as shown on Fig 2. From

user’s perspective, the inputs of NDDCHA are data set, stop criteria, base learning algorithm and negative data learning method. The stop criteria could be empty of negative data set, number of iteration and etc.

III. SCHEMA OF MULTIAGENT NEGATIVE DATA MINING

Based on the understanding of negative data, a negative data concept and NDDCHA, a schema is proposed to do data partitioning. Data from data warehouse are fed into base learning agent. The output of base learning agent is positive and negative data. The positive data contains the knowledge data whereas negative data does not. Negative data agent learns model from negative data set and transmits the positive data to assembly agent to form whole knowledge picture. Since negative data involves useful information, the negative data agent recursively splits the negative data into sub-positive and sub-negative data. Its

iteration generates a pair of positive and negative data each round. The iteration does not stop until the stop criteria are reached.

The scheduling agent is a representative of the whole system interfaced with customers. The interior of rectangle is multiagent implementation of NDDCHA. The exterior is the input and output of the system. The outputs have categories where one is the knowledge data discovered and data subsets partitioned and the other is a signal of termination of system notifying other agent beyond NDDCHA system. There are four agent base learning agents ab, negative agent an, assembly agent aa and partitioning agent ap. When the scheduling agent receives a

Fig 2 The schema of multiagent negative data mining


mining job, it delegates the job to ab, then ab creates a hypothesis submitted to knowledge data base and relays the data to agent ap1 to form negative data. The next is a loop of negative data learning completed by agent an. The loop is the kernel of negative learning. Each time it creates a hypothesis and negative data subset. The system is flexible because the agents could be replaced easily. For example, we can choose either support vector machine or neural network algorithm as an algorithm of a learning agent. And we can choose fuzzy border or Euclidian distance as the partitioner. Assembly agent executes combination operating for two different data sets. The diamond shape of branch to stop learning is part of work of scheduling agent; therefore the stop event to customer is emitted by scheduling agent. This event only tells customer data mining has finished for this specific data set. The whole system is still running to respond the environment changing.

There are data flows in the system where one is classifying data set from data warehouse and the other is knowledge data hypotheses. The protocol for classifying data set is defined in the following as BNF-like representation: <class> = +1 | -1 <feature> = integer (>=1) <value> = real <pair> = <feature>:<value> <pairs> = <pair>+ <line> = <class> <pairs> <data> = <line>+ This protocol could be extended as a format of XML to exchange the data easily. The drawback is that XML presentation making the size of data files larger. The protocol of knowledge data is text descriptive file because different learning algorithm has different model representation. For example, the model in SVM is mainly consisted of kernel definition and a list of support vectors while the model in neural network is the topology of network and weight of edges. Rule based knowledge data is commonly used such as rough set, associate rules. Although model based knowledge data is hard for human to realize, it contains full coverage of information from classifying data set.

The cooperation of agents uses the strategy that control event transport into centralized registry table. It is a partial software implementation of MOST network[9]. Each function or interface of an agent registers in the registry in the system initialization phase. And system also maintains an event notification matrix for each property of an agent. The behavior is similar to the event listener in the Java which uses a container to register an event entry. Once a property of the agent is changed, it sends notification matrix to a manager, and the manager dispatches the event to all registered agents. The system contains two channels, the data channel and the control channel. The data channel is responsible for classifying data transportation while the control channel dispatches or broadcasts the message and

event. The data channel has synchronous and asynchronous types. The asynchronous data is used for high bandwidth packet, which is necessary for large size of data set. Based on MOST architecture, the system could also work on the distributed system.

IV. CONCLUSION

The NDDCHA improves the learning algorithm performance through compensating the base hypothesis by utilizing the negative data set. Useful information in the negative data is mined to benefit the model of an application. This approach expands the hypotheses space to close the target space so that the approximation error is reduced. Multi-agent technique extends the application of NDDCHA to mine the large size of data set and provides a mean to cluster the data by partitioning the negative data and positive data set. The schema proposed offers a flexible architecture to enable users to choose learning algorithm freely without breaking the software system. The schema described here is for binary classification. It could be extended to multiple classification data mining and regression because positive data and negative data concept is also valid in these scenarios.

Therefore, as our future work, we will study these scenarios and other efficient data protocols. The future work includes investigating the relationship between the partitioning method and the data distribution, and improving the efficiency of data transportation.

REFERENCES

[1] M. P. Singh and M. N. Huhns, Multiagent Systems: A Theoretical Framework for Intentions, Know-How, and Communications: Springer-Verlag, 1994.

[2] M. Klusch, S. Lodi, and M. Gianluca, "The role of agents in distributed data mining: issues and benefits," 2003.

[3] C. Ming-Syan, H. Jiawei, and P. S. Yu, "Data mining: an overview from a database perspective," Knowledge and Data Engineering, IEEE Transactions on, vol. 8, pp. 866, 1996.

[4] M. Wooldridge, An Introduction to MultiAgent Systems: John Wiley & Sons, 2002.

[5] J. S.-T. Nello Cristianini, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods: Cambridge University Press, 2000.

[6] V. N. Vapnik, The nature of statistical learning theory, 2 ed: Springer, 2000.

[7] R. E. Schapire, "The Boosting Approach to Machine Learning An Overview," presented at MSRI Workshop on Nonlinear Estimation and Classification, Berkeley, CA, 2001.

[8] L. Breiman, "Bagging predictors," Machine Learning, vol. 24, pp. 123-140, 1996.

[9] M. Cooperation, "MOST Specifications version 2.5," http://www.mostnet.de/downloads/Specifications/MOST%20Specifications/MOSTSpecification.pdf, 2005.


↑ TOP of text, page 2 and following pages,aligned to line↑

Distributed Multi-Agent Knowledge Space (DMAKS): A Knowledge Framework Based on MADWH

Adrian Gardiner Dept. of Information Systems, Georgia Southern University,

[email protected] [Leave 2 blank lines before beginning main text — page 1]

Abstract Business data warehousing, which is an information architecture built upon the principle of data centralization and objectivity, is poorly equipped for environments in which knowledge resources are highly distributed, locally managed, and increasingly feature less-structured data types. To address the limitations of traditional data architectures and to broaden the scope of information management, information architectures (called distributed knowledge management (DKM)) have been proposed that focus on managing the creation of local knowledge within autonomous groups and exchanging knowledge across them [5]. Commonly, DKM approaches advocate the use of peer-to-peer agent networks as the deployment methodology. In such networks, peers are either forced to group, or can group spontaneously [3]. In this paper, we propose an alternative approach for self-organizing peer grouping called Distributed Multi-agent Knowledge Space (DMAKS), which is a distributed knowledge repository modeled on the MADWH concept, as devised by [35]. The strengths of this approach are numerous, including: that self-organization is based on functionality, rather than purely semantics; self-organization is addressed at both a macro (society level) and micro-level (local level) through dynamical hierarchies; evolution in functional states and knowledge are possible; different levels of abstraction of both knowledge and functionality are available; and the native support of functional decomposition enables the construction of dynamic aspect systems. Key Terms - Data warehouse and repository, Knowledge management architecture, Multiagent systems, Peer-to-peer networks, Self-organizing systems. 1. Traditional Data Warehouse

Architecture Many companies have recognized the strategic

importance of the knowledge hidden in their large

databases and have built data warehouses [13]. A data

warehouse is a centralized collection of data and

metadata from multiple sources, integrated into a

common repository and extended by summary

information (such as aggregate views) that is used

primarily to support organizational decision making

[21].

Traditional data warehouses are designed as

permanent data repositories, where data is maintained by

building stovepipe systems that periodically feed in new

data. Data within a data warehouse is normally a

duplication of existing data sets that are located either

inside or external to the organization, but integrated and

transformed with the purpose of supporting decision

making [23].

Within the data warehouse paradigm, knowledge is

generated primarily two ways: from analyst-directed

interactions with the data repository, such as through

using visualization and OLAP tools; and identification

of interesting facts and patterns through the application

of data mining algorithms.

Typically, the data cube is used as a data model of the

data warehouse data, given the multi-dimensionality of

the stored data structures [13]. A data cube consists of

several independent attributes, grouped into dimensions,

and some dependent attributes, which are called

measures. A data cube can be viewed as a d-dimensional

array, with each cell containing the measures for the

respective sub-cube [13]. A data model suitable for

multidimensional analysis should provide means to

define: 1. dimension levels, 2. grouping/classification

relationships (that link those levels) and 3. analysis paths

[11].


Dimensions within data cubes commonly reflect

concept hierarchies [13]. A concept hierarchy is an

instance of a hierarchy schema. Hierarchy schema can

reflect simple hierarchical (each link between a parent

and child levels has one-to-many cardinalities) or more

complex hierarchical structures, such as non-strict

hierarchies, where connections between hierarchical

levels can be of a many-to-many cardinality [31].

Analysts, with a knowledge of the embedded concept

hierarchies, typically interact with the data cube to

perform multidimensional analysis (i.e., choose the most

relevant view for induction and hypothesis testing),

which mainly consists of the following view operations:

drill-down, roll-up, and slice and dice operations.

Limitations of traditional data warehouses include that

their infrastructure is centralized; inflexible; generates

redundant data; is reliant upon predefined dimensions,

dimension levels and hierarchies, and aggregation paths;

and given the need for renewing data sets, information

stored with data warehouses is commonly dated, which

undermines the ability of analysts to perform real-time

analysis.

2. Emerging Forms of Knowledge

Management Architecture With the advent of the semantic web, traditional

information architectures are expected to evolve into one

based more upon collections of autonomous and semi-

autonomous agents that come together to form flexible,

dynamic-learning decision analysis cooperatives. The

traditional architecture for business data warehousing,

which is built upon the limiting concepts of inflexible

system infrastructure, and the centralization and

duplication of data, is therefore poorly equipped to fit

into this new vision for information management.

Firestone [15] predicts that data warehouses will

increasingly move towards distributed data management,

to address a reality in organization information

architecture that can be best described as involving

“physically distributed knowledge objects across an

increasingly far-flung network architecture” [15].

Formally, [15] defines Distributed Knowledge

Management (DKM) as a system that manages the

integration of distributed objects into a functioning

whole to produce, maintain, and enhance a business

knowledge base. Firestone [14] also highlights that the

trend for developing local data marts (distributed data

marts) is a sign that many organizations’ knowledge

management architecture is moving towards becoming

decentralized and disintegrated. Firestone [14] defines

data marts as subject-oriented, integrated, time-variant,

non-volatile collection of data in support of

management's decision making process focused on a

single business process or single department (e.g.,

marketing).

The epistemological assumptions reflected within the

centralization paradigm have also been heavily criticized

by ([5], [4], [6]) and colleagues, who argue that given

the subjective nature of knowledge, centralization of

knowledge leads to taking knowledge out of its social

context, where stakeholders typically hold locally-held

shared interpretation schemas that give knowledge

meaning. Such views about the limitations of traditional

data architectures have coalesced into an approach to

information management called distributed knowledge

management (DKM) [5], which focuses upon managing

the creation of local knowledge within autonomous

groups and exchanging knowledge across them [5].

The two core principles of DKM, as devised by [5],

are: the principle of autonomy (communities of knowing

should be granted the highest possible degree of

semantic autonomy to manage their local knowledge);

and the principle of coordination (the collaboration

between autonomous communities must be achieved

through a process of semantic coordination, rather than

through centrally defined semantics (semantic

homogenization)). The principle of coordination

(semantic interoperability between domains) has been

addressed numerous ways - for example, through the use

of shared schemas (e.g. ontologies) (see, e.g., [2], [12]),

and local context descriptions (see, e.g., [7]).

Commonly, the deployment approach for DKM is

through the use of agent architectures - and in particular,

peer-to-peer computing (P2P computing). A DKM

system utilizing P2P computing is commonly referred to

as peer-to-peer knowledge management (P2PKM). An

example of such a system is [3]’s tool for DKM called

KEEx.


In a “pure” P2P architecture [1], peers (nodes) have

equal responsibility and capability, and may join or

leave the network at any time, thus eliminating the need

for a centralized server. Each peer member can make

information available for distribution and can establish

direct connections with any other member node to

download information [25]. Moreover, peers

dynamically discover other peers on the network and

interact with each other by sending and receiving

messages [1]. Accordingly, a client seeking information

from a P2P network searches across scattered collections

stored at numerous member nodes, all of which appear

to be a single repository with a single index [25].

The two core operations in most peer-to-peer systems

are: (1) finding the right peers for querying (peer

discovery and selection), and (2) the efficient routing of

messages (network topology) [17]. When considering

P2PKM, core operations may also include: (3) semantic

interoperability (a.k.a. semantic coordination, mediation,

and resolution) (peer interpretation and translation), and

(4) the organization of knowledge nodes (peer

membership, clustering, structure, and aggregation).

3. What is a Peer? [4] proposes to model an organisation as a

“constellation” of knowledge nodes, which are

autonomous and locally managed knowledge sources.

The concept of knowledge nodes is similar to [24]’s

concept of knowledge clusters. Within this framework, a

knowledge cluster is an instance of an ontology, and

therefore represents some structured knowledge. A

knowledge cluster may be related to the overall

knowledge of an agent, a specific task, or to a given

topic [24]. Basic operators on knowledge clusters

include addition, filtering, search, is-subpart-of, and

comparison [24]. If knowledge nodes / clusters can be

viewed as object instances, they then effectively act as

technical gatekeepers to a knowledge source.

At a broader level of composition, peers may also

contain networks. [28] addresses small world structure

networks - which exhibit special properties, such as a

small average diameter and a high degree of clustering.

[28] claims such a network typology is effective and

efficient in terms of spreading and finding information.

[28] also suggests a method of letting peers organize

themselves into a small world structured around the

topics that peers contain knowledge about (organization

around semantics).

As highlighted through the concept of small world

structures, it is important to appreciate that knowledge

nodes in many proposed P2PKM architectures can be

course grained, in that they may be a composite of a

number of different data types (and thus contain varying

data and semantic structures) (see, e.g., [4], [3]). In

addition, knowledge nodes may contain their own

internal network structures (organization).

4. Organizing Peers into Groups

In P2PKM systems, peers are either forced to group,

or can group spontaneously [3]. Groups of peers are

commonly referred to as communities, and membership

of a community may be open [3] and dynamic. The

community analogy is commonly used as there is

frequently an implied ‘social’ connection between peers

(see, e.g., [3]). Sometimes community membership will

be determined by designers (i.e., mandatory

membership), rather than through peer interactions. The

amalgam of all community instances is commonly

referred to as a society [35].

A community of peers may not necessarily share

physical proximity, but should possess relational

proximity (e.g., shared interest, intent, or practice),

which will give the community identity. It is therefore

likely that peers within a community will have a high

degree of shared semantics and functionality.

Once communities and relationships between them are

established, then this meta-structure becomes the

network typology.

Schmitz [27] proposes that communities of peers are

formed based upon measures of semantic similarity,

which can be calculated through inter-peer comparisons

of profile information, queries, or knowledge items. A

version of this approach is demonstrated in [29], who

outlines a semantic clustering approach for the

identification of small world structures.

It is of interest to note that other possible methods to

define communities, such as through shared structural or

functional properties, have not been emphasized in the


P2PKM literature. This lack of emphasis perhaps reflects

the P2P community’s focus on information search and

sharing, rather than accommodating functional

orchestration (the latter being a more dominant feature

of (web) service architectures), or other models of

resource sharing.

[24] warn that membership of a knowledge

community should not replace the intrinsic goal of an

agent for which it was introduced into the system. This

view implies that community membership should not

necessarily lead to homogenization.

5. What Does it Mean for Peer Groups to be Self-Organizing?

Gershenson and Heylighen [20] maintain the term

self-organizing has no universally accepted meaning;

however, self-organizing is often defined as global order

emerging from local interactions [19] and interactions

with the environment [16]. It therefore follows that

global order is an emergent property of the community.

Indeed, [26], in describing this process, refers to the

concept of emergent classification: the process of

obtaining novel classifications of an environment by a

self-organizing system, which can only be achieved

through structural changes.

In some proposed P2PKM systems, local interactions

involve the exchange of semantic information - for

membership testing, and community definition.

However, in other proposed systems, communities may

largely be pre-defined: for example, through

consideration of ontological classification of local

knowledge stocks, or when the design objective is to

mimic established human-based social networks (see,

e.g., [28], [29], [17]).

An aim of self-organization that is frequently cited is

to increase order (excluding imposed order) [20]. The

degree of disorder present within a system, roughly

speaking, is proportional to its degree of entropy [32] -

as a system high in entropy will likely lack structure or

differentiation between components [20].

In addition to order, [19] stress that organization

requires structure with function. Structure means that

components of a system are arranged in a particular

order (e.g., connections that integrate the parts into a

whole); while function means that this structure fulfils a

purpose [20] (i.e., has intent).

[18] holds the view that self-organization implies that

local behavior will be (at least partially) caused

internally, i.e. independently of the environment

(endocausality); or, in other words, processes occurring

within the boundary are not controlled or ‘organized’ by

an external agent, but (at least partially) by the system

itself [18]. This view supports the notion that a degree of

autonomy is required to enable self-organization.

Autonomy is defined by [27], in the context of DKM,

as users maintaining control over their knowledge, but

with the willingness to share knowledge with other

peers; and therefore, peers are not subject to central

coordination. [3] refers to this as ‘semantic autonomy’.

Semantic autonomy is a cornerstone principle of DKM

[5]. This type of autonomy is perhaps best characterized

as autonomy of control (vis-à-vis autonomy of function).

However, autonomy has a wider meaning than

autonomy of control. [8] states that autonomy means

self-governing (a.k.a. self-controlling, self-steering).

Moreover, [10] includes cohesion (relating to unity) as a

central notion in autonomy, and hence, an entity’s

identity: interactions within a system that bind its parts

are stronger than the interactions with external systems

and internal fluctuations. [9] claims cohesion is

important as it both unifies a dynamical object, and

distinguishes it from other dynamical objects (i.e.,

provides order).

Therefore, full autonomy implies that local decision

making and intent, while possibly influenced by receipt

of external knowledge and environmental interaction, is

independent of consideration of any external agent. At

the other extreme, a complete lack of autonomy implies

an agent to be completely under the control and

influence of an external agent, with no local initiative.

The degree of autonomy associated with members

within a society space will influence the type of

interactions a peer has with its environment and fellow

community members. For example, interactions between

members within a specific community may reflect a

degree of mutual dependence (i.e., lower autonomy),

given shared intent. In this regard, peers can be

considered to be semi-autonomous. Moreover, assuming


that peer membership is dynamic, no community should

be able to successfully define itself (and to re-evaluate

its identify and intent) without significant interactions

with non-member peers, as identity is relational within a

dynamic open system.

The above discussion underlies four propositions in

relation with the self-organization of peers: first, a

society of peers cannot be self-organizing when all peer

members have full autonomy, as no negotiation between

peer members is possible in terms of establishing

identity. In other words, peers need to be semi-

autonomous. Second, the level of autonomy between

peer communities will be less than the autonomy

between peer members within communities (within

community coupling is greater than coupling with the

environment); third, where peer communities have

different levels of autonomy within the society,

imperfections (inefficiencies) in network topologies may

eventuate. Fourth, there is a general inverse relationship

between peer grouping granularity and the level of peer

autonomy; and full autonomy can exist only at the

society meta-level.

It is interesting to note that research into self-

organization theory has emphasized organization

through functional classification, rather than through

semantics. For example, the concept of autonomy has

primarily been defined in terms of functionality, rather

than semantics. In contrast, P2PKM researchers have

commonly proposed using the latter approach as a

method for establishing group boundaries.

The emphasis on functionality in self-organization

research is not surprising, given that intent and identity

are primarily related to what ones does (function), rather

than what one knows. It follows that determining

boundaries based upon semantics will emphasize

semantic-based functions - which in proposed P2PKM

systems, is commonly information search and retrieval,

context sharing, and semantic resolution. Simply put:

semantic-based organization will tend to divide

information across semantically-derived boundaries (i.e.,

identity without intent).

A desirable property of P2PKM systems is to emulate

real-world social networks ([4], [3]). Real-world social

networks involve knowledge-intensive business

processes, including collaboration, coordination,

prediction, and reasoning, in which knowledge is not

only interpreted, but created (e.g., through insights, or

identification of patterns). Such a view supports the

assertion that knowledge networks should primarily be

defined through function, rather than through semantics.

Overall, we suggest that it is unlikely that classification

(identity) based on functionality and knowledge methods

will produce identical community boundaries (network

typology).

In summary, the current state of methods to enable

self-organization within P2PKM is still at an early stage

of development. There are three weaknesses in current

approaches in P2PKM in this area:

- organization methods emphasize semantics, over

function;

- organization methods have focused upon network

(society-level) organization, rather than self-organization

at finer levels of granularity; and,

- there is an implicit assumption in a number of

P2PKM approaches that knowledge nodes are stable,

and thus the issue of node evolution has not yet been

adequately addressed in the research literature.

6. DMAKS Recently, Zhang and Zhang [35] introduced the

concept of a multiagent data warehouse (MADWH)

structure as a novel approach for brain modeling. A

MADWH is a dynamically evolving structure that

classifies and organizes semiautonomous agents into

communities and societies for coordinated learning and

decision analysis [35]. Within this field of research,

brain functions are viewed as being analogous to

communities of fine-grained interacting multiagent

systems (MACs). An important challenge emphasized

by this research is to identify algorithms that emulate the

identification and evolution of multi-dimensional

structures (e.g., cuboid-based star schemas), thus

providing possible analogs for brain functionality

decomposition and granularity. Success in emulating

assembly and management of knowledge and functional

structures may provide insights into natural brain

organization, and provide the basis of new approaches

for developing intelligent machines.


Our intention in this paper, however, is not to emulate

brain functionality, but to discuss the application of

MADWH concepts to the field of DKM. In doing so, we

introduce the concept of a Distributed Multi-agent

Knowledge Space (DMAKS), which is a distributed

logical knowledge repository modeled on the MADWH

concept. A DMAKS instance, while being distributed,

has clear logical boundaries and a similar meta- and

micro-structure to that of a MADWH.

Within a MADWH, the distinction between

autonomous and semiautonomous agents is important, as

the functionality of an autonomous agent (e.g., human

being) is materialized through the coordination of

encapsulated single-function semi-autonomous agents.

Within this context, a semi-autonomous agent is one that

does not possess full autonomy (independent control) for

its behavior (functionality / actions), and therefore does

not take actions without coordination with other agents.

In this way, semi-autonomous agents differ from more

coarse-grained agents typically found in many MACs,

whose actions usually exhibit a greater level of

autonomy [34]. The proposal for semi-autonomous

agents is predicated upon observations that brain

functions are not performed in isolation [35]. For

example, the left and right ears of a human being cannot

be considered to be truly functionally independent.

Zhang and Zhang [35] propose that associated semi-

autonomous agents can be identified (mined) by

examining their degree of functional similarity: similar

(but not identical) agents will share one or more corner

parameters, which are dimensions of agents’ actions that

are common across agent instances. The more distinct

are these actions, the greater will be the corner distance

between the two (corner) agents.

Under this approach, agent similarity can be measured

in terms of distance across a multi-dimensional function

space, with a theoretical threshold distance logically

distinguishing distinct agent communities. [35] and [34]

discuss several functional distance measures that can be

used to identify corner agents. New cuboid corner agents

(or new cuboids) are established in a MADWH through

extrapolation, and existing cuboids are fine tuned

through interpolation [35], as new knowledge and

functionality are found by existing agents that may

redefine existing functional taxonomies.

Once corner agents are identified (agents within the

threshold distance), they are organized into base cuboids

(corner agent sets), thus representing local (micro)

functional structure. In this way, the MADWH

emphasizes decomposition based upon functionality,

rather than, for example, semantics. The implicit

assumption in MADWH is that knowledge is tied to

(enables) functionality.

Within a DMAKS, a knowledge stock will not be

considered part of the logical repository unless it is

associated with an active function. In this respect,

knowledge associated with active functions becomes

active knowledge. Agents supporting functions outside

the scope of the knowledge repository (inactive

functions) will be excluded (i.e., becoming inactive

knowledge). However, the functionality support of a

DMAKS repository is constantly under revision, and

agents will join or leave the DMAKS repository

according to functional demand (activation). Therefore,

membership of the DMAKS society is also determined

primarily on a functional basis, rather than through

semantics (cf. P2PKM). Primary generic functions

supported within the DMAKS instance include

prediction, analytical, and reasoning functions; while

secondary generic functions include pattern

identification functions, and alert (arousal) systems. In

this way, the functionality of DMAKS is clearly aimed

at decision support, rather than at direct support of

operational processes.

Agents within a MADWH (agent community) belong

to dynamic agent cuboids, which organize similar

(cooperative or competitive) semiautonomous corner

agents into hypercube structures. The value of

organizing agents in this way is that the cuboid meta-

structure can easily accommodate natural growth and

evolution across a number of functional dimensions [35]

(i.e., changes in the definition of functions or what

functions are active will be reflected within the relevant

agent cuboids). MADWH also provides a meta-structure

for cuboid organization at the agent community and

society levels. Local agent cuboids within the MADWH

agent society are organized and interconnected through a


Hasse-type lattice cuboid meta-structure [35]. This

structure can also be applied at the level of society

abstraction to interlink all functional communities within

the society. These organization structures are also

included within a DMAKS instance. The lattice cuboid

meta-structure provides conditions for dynamic

hierarchies, which allow for emergent definitions of

active functions. [18] define a dynamical hierarchy as a

dynamical system with multiple levels of nested

subcomponent structures, in which the structures and

their properties at one level emerge from the ongoing

interactions between the components of the lower level.

Adapting the lattice meta-structure to DMAKS allows

different operations commonly associated with

reasoning and interrogation to be performed at different

levels of abstraction (e.g., level of problem

representation). In a drill down operation, levels of

functional abstraction will be reduced until a function

(and therefore knowledge set) best fitting the active

problem is found. An example of roll-up is to summarize

functional knowledge and reasoning through the lattice.

Roll-up operations can also be performed by either

relaxing or tightening the distance threshold that act to

define cuboid dimensions. Typical slice/dice operations

can be performed by focusing on a sub-part of the lattice

structure.

Given that decomposition within DMAKS is based on

functionality, rather than semantics, this approach allows

for access to knowledge across domains or knowledge

nodes. (In other words, multiplicity in knowledge

application is directly supported through functional

decomposition.) Such access to functionality and

knowledge is important in the development of dynamic

aspect systems, which is difficult to implement in

current forms of P2PKM. An aspect system can be

defined as a functional part of a system, limited to some

of its properties or aspects ([30], cited in [20]). Aspect

approaches present an alternative to system analysis

based on subsystem decomposition or object

classification. In decision support, an aspect approach is

quite often required by looking at a problem along

different dimensions, and therefore support for this

approach to viewing knowledge and functionality is

attractive. From a systems perspective, examples of

aspects are technical, legal and organizational, social,

cultural and economical aspects [33]. Within computer

science, there is also gaining interest in aspect

computing [22], which is based on the concept of

functional cross-cutting. Examples of cross-cuts

(aspects) within the context of operating systems are:

algorithm, data structure, reference locality [22].

7. Conclusion Thus, by adapting MADWH concepts to the

architecture of a knowledge space in the form of a

DMAKS, we have outlined a possible high-level

theoretical architecture for self-organizing peers within

P2PKM. Advantages of this approach include:

- self-organization is based on functionality, rather

than semantics;

- self-organization is addressed at both a macro

(society level) and micro-level (local level) through

dynamical hierarchies;

- evolution in functional states and taxonomies are

possible;

- different levels of abstraction of both knowledge and

functionality are available;

- the native support of functional decomposition

enables the construction of dynamic aspect systems;

- emergent classification and recognition of

functionality is supported through the processes of

interpolation and extrapolation; and,

- the evolution of active functions and knowledge is

managed effectively through a coordinated

multidimensional structure.

8. References [1] S. S. R. Abidi, and X. Pang, “Knowledge Sharing

Over P2P Knowledge Networks: A Peer Ontology And Semantic Overlay Driven Approach” International Conference on Knowledge Management, Singapore, (13-15 December 2004).

[2] M. Arumugam, A. Sheth, and I. B. Arpinar, “Towards Peer-to-Peer Semantic Web: A Distributed Environment for Sharing Semantic Knowledge on the Web” Technical report, Large Scale Distributed Information Systems Lab, University of Georgia, (2001).

[3] M. Bonifacio, P. Bouquet, P. Busetta, A. Danieli, A. Do-nà, G. Mameli, and M. Nori KEEx: A Peer-to-


Peer Tool for Distributed Knowledge Management, Working Paper, (2005).

[4] M. Bonifacio, P. Bouquet, and R. Cuel, "Knowledge Nodes: the Building Blocks of a Distributed Approach to Knowledge Management" Journal of Universal Computer Science, vol. 8, no.6, (2002), pp.652-661.

[5] M. Bonifacio, P. Bouquet, and P. Traverso. “Enabling Distributed Knowledge Management. Managerial and Technological Implications.“ Informatik: Zeitschrift der schweizerischen Informatikorganisationen, vol. III, (2002), pp.1-7.

[6] M. Bonifacio, P. Bouquet, and G. Mameli and M. Nori. “Peer-mediated Distributed Knowledge Management” Technical Report # DIT-03-032, (2003).

[7] P. Bouquet, A. Dona, L. Serafini, and S. Zanobini, “ConTeXtualized local ontology specification via CTXML” AAAI 2002 Workshop on Meaning Negotiation, 28th July (2002).

[8] J. Collier. “What is Autonomy?”, International Journal of Computing Anticipatory Systems: CASY 2001 - Fifth International Conference, (2002).

[9] J. Collier. “Interactively Open Autonomy Unifies Two Approaches to Function”, In: Computing Anticipatory Systems: CASY’03 - Sixth International Conference, edited by D. M. Dubois, American Institute of Physics, Melville, New York, AIP Conference Proceedings 718 (2004), pp. 228-235.

[10] J. Collier, “Information Theory as a General Language for Functional Systems”, Anticipatory Systems: CASY’99 - Second International Conference, edited by D. M. Dubois, American Institute of Physics, Woodbury, New York, AIP Conference Proceedings, (2000).

[11] P. Diderichsen, Selective attention in the development of the passive construction: a study of language acquisition in Danish children. In Engberg-Pedersen, E. and P. Harder Ikonicitet og Struktur, Netværk for Funktionel Lingvistik, Department of English, University of Copenhagen, (2001).

[12] M. Ehrig, P. Haase, N. Stojanovic, and M. Hefke, Similarity for ontologies - a comprehensive framework. In Workshop Enterprise Modelling and Ontology: Ingredients for Interoperability, PAKM 2004, (Dec. 2004).

[13] M. Ester, J. Kohlhammer, and H. Kriegel, "The DC-tree: A Fully Dynamic Index Structure for Data Warehouses." Proc. 16th Int. Conf. on Data Engineering (ICDE 2000), San Diego, CA, pp. 379-388, (2000).

[14] J. M. Firestone, “DKMS Brief No. Six: Data Warehouses, Data Marts, and Data Warehousing:

New Definitions and New Conceptions”, White Paper http://www.dkms.com/White_Papers.htm, (1997).

[15] J. M. Firestone, “Distributed Knowledge Management Systems: The Next Wave in DSS.” White Paper, http://www.dkms.com/White_Papers.htm, (1997).

[16] E. Gonzalez, M. Broens and P. Haselager, “Consciousness and Agency: The Importance of Self-Organized Action”, Networks, vol. 3-4, (2004).

[17] P. Haase and R. Siebes, “Peer Selection in Peer-to-Peer Networks with Semantic Topologies” Proceedings of the International Conference on Semantics in a Networked World, (ICNSW'04), (2004).

[18] F. Heylighen, Mediator Evolution: A General Scenario for the Origin of Dynamical Hierarchies, Presented at: “Interactively Open Autonomy Unifies Two Approaches to Function”, Evolvability & Interaction: Evolutionary Substrates of Communication, Signaling, and Perception in the Dynamics of Social Complexity, (2003).

[19] F. Heylighen and C. Gershenson, “The Meaning of Self-organization in Computing”, IEEE Intelligent Systems, section Trends & Controversies - Self-organization and Information Systems, (May/June 2003).

[20] C. Gershenson and F. Heylighen, “When Can we Call a System Self-organizing?” Complexity Digest. http://www.comdig.org/ (2003).

[21] J. Hoffer, M. Prescott, and F. McFadden, Modern Database Management, 7th ed. Upper Saddle River, NJ: Prentice Hall, (2005).

[22] G. Kiczales, J. Irwin, J. Lamping, J. Loingtier, C. Lopes, C. Maeda and A. Mendhekar, “Aspect-oriented Programming”, In proceedings of the European Conference on Object-Oriented Programming (ECOOP), (1997).

[23] E. Malinowski, and E. Zimányi, “OLAP Hierarchies: A Conceptual Perspective”, CAiSE 2004, (2004), pp. 477-491.

[24] P. Maret, M. Hammond and J. Calmet, “Virtual Knowledge Communities for Corporate Knowledge Issues”, Proceedings of ESAW 04, (2004), pp. 33-44.

[25] M. Parameswaran, A. Susarla and A. B.Whinston, “P2P Networking: An Information-Sharing Alternative”, IEEE Computer, vol. 34(7). (2001), pp. 31-38.

[26] L. M. Rocha, “Syntactic Autonomy” Rocha, Luis M. In: Proceedings of the Joint Conference on the Science and Technology of Intelligent Systems (ISIC/CIRA/ISAS 98). National Institute of


http://www.dkms.com/White_Papers.htm

http://www.dkms.com/White_Papers.htm

http://www.comdig.org/

Standards and Technology, Gaithersburg, MD, September 1998. IEEE Press, (1998), pp. 706-711.

[27] C. Schmitz, “Towards Self-Organizing Communities in Peer-to-Peer Knowledge Management”, Workshop on Ontologies in Peer-to-Peer Communities, ESWC 2005, (2005).

[28] C. Schmitz, “Self-Organization of a Small World by Topic” P2PKM 2004, (2004).

[29] C. Schmitz, “Towards Content Aggregation on Knowledge Bases through Graph Clustering”, Grundlagen von Datenbanken (2005), pp. 112-116.

[30] W. ten Haaf, H. Bikker, and D.J. Adriaanse, “Fundamentals of Business Engineering and Management, A Systems Approach to People and Organisations, Delft University Press (2002).

[31] A. Tsois, N. Karayannidis, and T. K. Sellis, “MAC: Conceptual Data Modeling for OLAP”, Design and Management of Data Warehouses, (2001).

[32] http://en.wikipedia.org[33] J. Zevenbergen, “A Systems Approach to Land

Registration and Cadastre”, Nordic Journal of Surveying and Real Estate Research, Vol. 1, (2004).

[34] W. Zhang, Nesting, safety, layering, and autonomy: a reorganizable multiagent cerebellar architecture for intelligent control––with application in legged locomotion and gymnastics, IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 28 (3) (1998), pp. 357–375.

[35] W. Zhang and L. Zhang, “A Multiagent Data Warehousing (MADWH) and Multiagent Data Mining (MADM) Approach to Brain Modeling and Neurofuzzy Control”, Information Sciences, 167, (2003), pp. 109-127.


http://en.wikipedia.org/wiki/Main_Page

Applying MultiAgent Technology for Distributed Geospatial Information Services

Naijun Zhou 1 and Lixin Li 2 1 Department of Geography, University of Maryland - College Park

2 Department of Computer Sciences, Georgia Southern University [email protected], [email protected]

Abstract

This paper describes a framework of using multiagent technology to facilitate the sharing and query of distributed geospatial data. In particular, under the architecture of distributed Geospatial Information Services (GIServices), this paper proposes to use query agent to process user query, metadata agent to represent the metadata of geospatial data sets, discovery agent to locate candidate data sets related to the user query, schema agent and semantics agent to identify the same or similar schema attributes and domain values in candidate data sets.

1. Introduction

The World Wide Web provides a platform for sharing and serving distributed geospatial data. Indeed, a new architecture, Geospatial Information Services (GIServices), has been proposed to support distributed geospatial data storage, query and delivery over the Web [1]. In parallel, the technologies of multiagent data warehousing and multiagent data mining enable the agent-based discovery and integration of data from distributed sources, which can be applied into the design and implementation of GIServices.

Using multiagents to query and retrieve geospatial data has been proposed by, e.g., [2] and [3]. However, multiagent technology has not be examined specifically for Web-based distributed GIServices. This paper introduces a framework of using multiagents to facilitate distributed GIServices. The proposed multiagents allow communications among user query, metadata and geospatial databases; and provide a framework to resolve some critical issues of GIServices, including data discovery, schema matching and semantic integration. Specifically, this paper discusses the following agents:

• query agent: to query data and return query results to the user;

• discovery agent: to identify candidate data sets that are suitable for conducting a full query;

• metadata agent: to present metadata; • schema agent: to present schema definitions,

and to identify the same/similar attributes between a query agent and each candidate data set;

• semantics agent: to present the semantic definitions of database domain values, and to identify the same/similar values between a query agent and each candidate data set.

Following the typology of agent by Nwana [4], the

agents of metadata, schema and semantics are static agents residing with each data set; and the query and discovery agents are mobile agents that can migrate among different data sources. All agents are deliberative agents, i.e., these agents have internal information and can negotiate with other agents. 2. A Framework of MultiAgent-based GIServices 2.1 System Architecture

The framework of using multiagent technology in GIServices is depicted in Figure 1. Distributed geospatial data sets are maintained locally by their providers, making data update and maintenance more efficient than storing data on a centralized site. Together with every geospatial data set, information of metadata (e.g., date of data production, spatial extent and theme keywords), schema definitions (i.e., the meanings of the database attributes), and semantics definitions (i.e., the meanings of domain values) are also made available through the agents. A query agent accepts and processes user queries, migrates to each


local data set, and communicates and negotiates with the metadata, schema and semantics agents. A discovery agent collaborates with the query agent to locate candidate data sets that satisfy the criteria of a certain user query.

Query Agent

Metadata AgentSchema AgentSemantic Agent

Data Set 1


Data Set 2


Data Set n-1


Data Set n

Discovery Agent

Discovery Agent

Discovery Agent

Discovery Agent

Query Agent


Data Set 1


Data Set 2


Data Set n-1


Data Set n

Discovery Agent

Discovery Agent

Discovery Agent

Discovery Agent

Figure 1. Applying multiagent technology to facilitate GIServices.

2.2 MultiAgents for GIServices

This section examines the agents proposed to support GIServices. These agents work together to accomplish the tasks of finding, querying and delivering distributed geospatial data. 2.2.1. Query agent. A query agent allows users to pose queries by providing a set of terminology (ontology) of the query predicate. The user query is parsed and interpreted by the query agent and is represented in a format that can be understood by other agents. As a mobile agent, the query agent is the major agent to communicate with other agents in order to return query results to the user. 2.2.2. Discovery Agent. Once accepting a user query, the query agent gives part of its information to a discovery agent. The discovery agent is a mobile agent

being able to communicate with metadata agents and to identify candidate data sets. 2.2.3. Metadata Agent. Every data set has a metadata agent. A metadata agent contains not only the metadata of a geospatial data set, but also the algorithm to communicate with the discovery agent. The communication is a negotiation process between the discovery agent and the metadata agent: exchanging and comparing the agents’ information such as the spatial extent, the theme, and the date of data production. The result of the negotiation process is returning the query agent a list of candidate data sets for full query. 2.2.4. Schema Agent. A schema agent, residing with each data set, contains the definitions of geospatial database attributes. After the discovery agent identifies the candidate data sets, the query agent compares each of its attribute names with the attribute names defined in a schema agent, and finds the same or similar attribute name in the schema agent. Most likely, however, the attribute names by the query agent and by the schema agent are different. The possible solutions to this difference can be the use of a pre-defined lookup table, deploying a single ontology, or relying on a thesaurus. 2.2.5. Semantics Agent. A semantics agent aims to find the same or similar domain values (semantics) between the query agent and a data set. A semantic agent sits with a data set, maintains semantic information including the ontology, definitions, and thesaurus of local semantics. Computational algorithms are also provided with each semantic agent, allowing a semantic agent to compare local semantics with the semantics in a query agent. 3 An Example of MultiAgent-based GIServices

This section explains with the aid of an example how the agents are represented and how they communicate with each other. This section also briefly discusses the fundamental reasoning methods while leaving more advanced algorithms for future work. Figure 2 illustrates the flow of multiagent communications in order to answer a user query: “find the land use of cropland in Prince George’s County”.


User Query Agent

Discovery Agent

Metadata Agent

Query Agent

Schema Agent

Message: find land use cropland in PG County

Message: find data of spatial-extent=“PG County”theme=“Land Use”

Message: find attribute-name = “Land Use Type”

Action: XML representation of the request

Action: compare local metadata to the message, find PG County data set

Action: compare local schema to the message, find attribute lu

Query Agent

Semantic Agent

Message: find domain-value = “cropland”

Action: compare its domain values to the message, find cropland/pasture

UserQuery Agent

Message: query lu=“cropland/pasture” Geospatial

Database(PG County)

Query result in XML

User Query Agent

Discovery Agent

Metadata Agent

Query Agent

Schema Agent

Message: find land use cropland in PG County

Message: find data of spatial-extent=“PG County”theme=“Land Use”

Message: find attribute-name = “Land Use Type”

Action: XML representation of the request

Action: compare local metadata to the message, find PG County data set

Action: compare local schema to the message, find attribute lu

Query Agent

Semantic Agent

Message: find domain-value = “cropland”

Action: compare its domain values to the message, find cropland/pasture

UserQuery Agent

Message: query lu=“cropland/pasture” Geospatial

Database(PG County)

Query result in XML

Figure 2. A work-flow of agent communications for GIServices.

The user query is represented in XML in a query

agent, including the information of the theme, spatial extent, the attribute names, domain values, and the ontology the user applied to pose the query, etc. Figure 3 shows a query agent for the land use example. The query agent sends two tasks to the discovery agent: searching data sets within the spatial extent of Prince George’s County and searching data sets with a theme of land use (Figure 4).

<query> <theme>Land Use</> <attribute-name>Land Use Type</> <domain-value>cropland</> <spatial-extent>Prince George’s County</> <ontology>my-own-ontology</> </query>

Figure 3. The XML representation of a query agent.

<message> <theme>Land Use</theme> <spatial-extent>Prince George’s County</> <ontology>my-own-ontology</> <collaborative-agent> <agent-name>discovery agent</> <task>spatial-extent</> <task>theme</> </collaborative-agent> </message>

Figure 4. A message sent by query agent to discovery agent.

Once the discovery agent receives the spatial-extent

task, it will perform the following actions:

• obtaining the spatial extent in coordinates (X,Y) of the location Prince George’s County using a geospatial gazetteer;

• communicating with all local metadata agents (a metadata agent is shown in Figure 5) distributed in the network, which has the information of the spatial extent and theme of each local data set;

• comparing (i.e., spatial query) with every metadata agent’s bounding_cooridnates to find candidate data sets.

Similarly, the discovery agent executes the task of

searching data with a theme of land use by comparing the term “land use” to local metadata agent’s theme_keywords; for example, the <theme_keywords>land use</> in a metadata agent shown in Figure 5. The metadata agent via the discovery agent returns to the query agent a list of candidate data sets satisfying both the spatial extent and theme criteria.

<FGDCMetadata> <bounding_coordinates> <West_bounding_coordinate>-77.1</> <East_bounding_coordinate>-76.7</> <North_bounding_coordinate>39.1</> <South_bounding_coordinate>38.5</> </bounding_coordinates> <theme> <theme_keyword>land use</> </theme> </FGDCMetadata>

Figure 5. A metadata agent.

Once one or more candidate data sets are identified

by the discovery agent, the query agent sends another task of attribute-name to each schema agent of the


candidate data sets in order to find the attribute land use type in each candidate data set (Figure 6). The schema agent compares its schema definitions to the definitions in the query agent and finds the same or a similar attribute for land use type, e.g., lu. The schema agent returns to the query agent a list of same/similar attributes (e.g., lu) in candidate data sets.

<message> <attribute-name>Land Use Type</> <ontology>my-own-ontology</> <collaborative-agent> <agent-name>schema agent</> <task>attribute-name</> </collaborative-agent> </message>

Figure 6. A message of finding a local attribute name that is the same as or similar to land use.

To find the local domain value(s) that is same as or

similar to cropland, the query agent communicates with the semantics agent and compares their domain values (semantics). A request sent by the query agent to the semantics agent is shown in Figure 7. The algorithm of comparing and finding the same or similar local domain values (land use codes) for cropland can be based on a thesaurus such as WordNet, or with algorithms introduced by Wiegand and Zhou [5] and Zhou [6]. The semantic relations between the domain values can be sameAs, differentFrom, include, etc., which are expressed in a machine-readable format such as Web Ontology Language (OWL) [7]. Figure 8 is an ontology of the domain values of the attribute lu (i.e., land use codes) in the query agent, and Figure 9 shows local land use codes in a semantic agent having a sameAs semantic relation with the query agent’s cropland.

<message> <domain-value>cropland</> <spatial-extent>Prince George’s County</> <ontology>my-own-ontology</> <collaborative-agent> <agent-name>discovery agent</> <task>domain-value</> </collaborative-agent> </message>

Figure 7. A request of finding the same or a similar domain value(s) cropland.

<owl:Class rdf:ID=“LandUseUserOntology"> <owl:oneOf rdf:parseType="Collection"> <Ontology rdf:about=“#Agricultural"/> </owl:oneOf> </owl:Class> <owl:Class rdf:ID=“Agricultural"> <owl:oneOf rdf:parseType="Collection"> <Agriculture rdf:ID=“#cropland"/> <Agriculture rdf:ID=“#farm resident"/> ... </owl:oneOf> </owl:Class>

Figure 8. An OWL representation of the land use codes (domain values) of the land use attribute name in the query agent.

<owl:Class rdf:ID="PrinceGeorge’s Codes"> <owl:oneOf rdf:parseType="Collection"> <PGCodes rdf:ID=“crop/pasture"/> <owl:sameAs rdf:resource="#cropland"/> </PGCodes> </owl:oneOf> </owl:Class>

Figure 9. An OWL representation of the land use codes of the lu attribute name in the Prince George’s County database having a sameAs semantic relation with cropland.

After the communications among the agents, the

query agent can use query re-write technology to convert a user query to local queries for all candidate data sets, and return query results in XML to the user. A sample local query is shown in Figure 10.

<local-query> <theme>Land Use</> <schema-name>lu</> <domain-value>crop/pasture</> <spatial-extent>x1,y1,x2,y2,….</> <ontology>local-ontology</> </local-query>

Figure 10. A local query.

5. Conclusions and Discussions

This paper proposes a framework of the use of multiagents to represent, discover and query geospatial data. In particular, this paper proposes to use query agent to process user queries, metadata agent to represent the metadata of geospatial data sets, discovery agent to identify candidate data sets for a full query, schema agent and semantic agent to find the


same or similar attribute names and domain values in candidate data sets.

Both multiagent technology and GIServices are emerging research issues that require further investigation. Future work includes the design and representation of the agents’ capability and communication, the methodological research on identifying the same/similar metadata, schema attributes and semantics (i.e., GIS interoperability), and an improved GIServices architecture to support the multiagent technology.

References [1] M. Tsou, and B.P. Buttenfield, “A Dynamic Architecture for Distributing Geographic Information Services”. Transactions in GIS 6(4):355-381, 2002. [2] Y. Luo, X. Wang, and Z. Xu, “Agent-based Collaborative and Paralleled Distributed GIS”, XXth ISPRS Congress, July 12-23, 2004, Istanbul, Turkey. [3] J.J. Nolan, R. Simon, and A.K. Sood, “An Agent-based Approach to Imagery and Geospatial Computing”, AGENT’01, May 28-June 1, 2001, Montreal, Quebec, Canada. [4] H.S. Nwana, “Software Agents: An Overview”. Knowledge Engineering Review 11(3): 205-244, 1996. [5] N. Wiegand, and N. Zhou, “An Ontology-based Geospatial Web Query System”, In Peggy Agouris et al. (eds.), Next Generation Geospatial Information, Taylor and Francis, 2005, pp. 157-167. [6] N. Zhou, “A Study on Automatic Ontology Mapping of Categorical Information”. Proceedings of National Conference on Digital Government Research. Boston, May 18-21, 2003, pp. 401-404. [7] World Wide Web Consortium, “OWL Web Ontology Language”, http://www.w3.org/TR/owl-features, 2004.


Published by Department of Mathematics and Computing Science

Technical Report Number: 2005-04 November, 2005

ISBN 0-9738918-0-7

multiagent data warehousing (madw) and multiagent data...

Documents