csm: a cloud service marketplace for complex service acquisition · 2017. 7. 26. · 8 csm: a cloud...

8

CSM: A Cloud Service Marketplace for Complex Service Acquisition

YEXI JIANG, Nanjing University of Posts and Telecommunications & Florida International UniversityCHANG-SHING PERNG, ANCA SAILER, and IGNACIO SILVA-LEPE,IBM T. J. Watson Research CenterYANG ZHOU, Georgia Institution of TechnologyTAO LI, Nanjing University of Posts and Telecommunications & Florida International University

The cloud service marketplace (CSM) is an exploratory project aiming to provide “an AppStore for Services.” Itis an intelligent online marketplace that facilitates service discovery and acquisition for enterprise customers.Traditional service discovery and acquisition are time-consuming. In the era of OneClick Checkout and pay-as-you-go service plans, users expect services to be purchased online efficiently and conveniently. However, asservices are complex and different from software apps, the currently prevailing App Store based on keywordsearch is inadequate for services.

In CSM, exploring and configuring services are an iterative process. Customers provide their requirementsin natural language and interact with the system through questioning and answering. Learning from theinput, the system can incrementally clarify users’ intention, narrow down the candidate services, and profilethe configuration information for the candidates at the same time. CSM’s back end is built around the ServicesKnowledge Graph (SKG) and leverages data mining technologies to enable the semantic understanding ofcustomers’ requirements. To quantitatively assess the value of CSM, empirical evaluation on real andsynthetic datasets and case studies are given to demonstrate the efficacy and effectiveness of the proposedsystem.

Categories and Subject Descriptors: H.4.m [Information Systems]: Miscellaneous

General Terms: Design, Algorithms

Additional Key Words and Phrases: Cloud service, interactive search, semantic web

ACM Reference Format:Yexi Jiang, Chang-Shing Perng, Anca Sailer, Ignacio Silva-Lepe, Yang Zhou, and Tao Li. 2016. CSM: A cloudservice marketplace for complex service acquisition. ACM Trans. Intell. Syst. Technol. 8, 1, Article 8 (July2016), 25 pages.DOI: http://dx.doi.org/10.1145/2894759

1. INTRODUCTION

1.1. Background

The advance in virtualization technology has triggered the emergence of cloud ser-vice. Over the past several years, an increasing number of service providers migrated

The work is partially supported by the National Science Foundation under grants CNS-1126619, IIS-121302,and CNS-1461926.Authors’ addresses: Y. Jiang and T. Li, School of Computing and Information Sciences, Florida InternationalUniversity, 11200 SW 8th Street, Miami, FL, 33199 USA & School of Computer Science, Nanjing Universityof Posts and Telecommunications; C.-S. Perng (current address), Google Inc. 1600 Amphitheatre ParkwayMountain View, CA 94043 USA; A. Sailer and I. Siva-Lepe, IBM T.J. Watson Research Center, YorktownHeights, NY 10598, USA; Y. Zhou, College of Computing, Georgia Institute of Technology, Atlanta, GA 30332USA.Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrights forcomponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted.To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of thiswork in other works requires prior specific permission and/or a fee. Permissions may be requested fromPublications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected]© 2016 ACM 2157-6904/2016/07-ART8 $15.00DOI: http://dx.doi.org/10.1145/2894759

ACM Transactions on Intelligent Systems and Technology, Vol. 8, No. 1, Article 8, Publication date: July 2016.

http://dx.doi.org/10.1145/2894759

http://dx.doi.org/10.1145/2894759

8:2 Y. Jiang et al.

their traditional IT services to the cloud. Due to their flexibility, convenience, andlow cost, cloud service has gradually become a high priority for enterprise serviceusers. However, as more services become available online, the question of how to trackdown a service satisfying a particular set of requirements rises as a new challengefor those customers. Traditionally, obtaining the desired service involves the followingsteps: (1) survey all potential service providers and identify the candidate services; (2)visit the providers’ websites and gather information about the services and providers’contact venues; (3) contact the services’ agents for detailed information about theirservices and share with them both the functional and nonfunctional requirements ofthe researched service; and (4) make a decision on the candidate services, proceed withthe purchase and on-boarding, and conduct service configuration. Due to the largenumber of providers and services, the traditional service acquisition is tedious andtime-consuming. Moreover, since each service provider has its own ways of describingservices, there is no systematic way for customers to compare similar services in termsof their features. Finally, new service providers are always joining the market, and itis thus difficult for service users to obtain a complete list of providers of their desiredservices.

1.2. Limitation of Existing Marketplaces

A possible and quick solution to improve the effectiveness and efficiency of service ac-quisition is to advertise the services in the existing online marketplace, such as Amazonand eBay. Traditional marketplaces provide a keyword-based search that allows theservice users to find products by providing a set of keywords. This method is effectivefor commodity products as their characteristics can be adequately described by severalkeywords. However, since these online marketplaces are specifically designed for sell-ing commodity products, they are not suitable for service customers due to the uniquecharacteristics of the services. There are mainly two major differences between servicesand commodity products: (1) Services are more complex than commodity products. It isdifficult to precisely describe a service with a small number of keywords, which makethe searching of services challenging. (2) Services are largely semifinished products,while commodity products are mostly end products. Strictly speaking, no unconfiguredservice can directly satisfy all of a customer’s needs. Therefore, every time a customeracquires a service, a customization procedure involving multiple rounds of interac-tions between the customer and the service provider is unavoidable. As existing onlinemarketplaces are either one-round search or multiple-round keyword-based facetedsearch, the aforementioned two characteristics make them inadequate for service ac-quisition [Akolkar et al. 2012].

A few cloud service providers, such as Amazon and HP, partially addressed theproblem by providing dedicated service marketplaces like Amazon Cloud Marketplaceand HP Cloud. However, these vendor-dedicated marketplaces only advertise their ownservices but do not allow the customers to systematically compare all available servicesacross different service providers. Therefore, whenever the customers need services,they still need to visit the marketplaces of individual service providers.

There are also some efforts to help customers choose the services across the serviceproviders [Li et al. 2010b]. However, their work restricts the services in the categoryof Infrastructure-as-a-Service by recommending services from the application perfor-mance perspective. Different from theirs, the Cloud Service Marketplace (CSM) is de-signed to help customers find proper services according to their business requirementsand involves a broader range of services, including IaaS, PaaS, and SaaS.

1.3. A Better IT Service Ecosystem

In the ecosystem of traditional IT services, there are two kinds of participants: enter-prise service users and service providers. The information exchanged during a service


CSM: A Cloud Service Marketplace for Complex Service Acquisition 8:3

matching procedure defines a graph with a many-to-many relationship between theparticipants. This kind of service ecosystem requires a strenuous effort for the serviceconsumer to find a service matching the requirements. Indeed, once the number ofcandidate services is large, the customers are not likely to efficiently find a properservice given that they need to consider the service’s capabilities, functionalities, andprices. To address the limitations on the two-participant ecosystem, we propose CSMas a third participant to the existing ecosystem. CSM acts as an intermediary betweenservice consumers and service providers. From the user’s perspective, CSM owns thefollowing key advantages:

(1) CSM simplifies the information exchange between service consumers and providers.CSM becomes the hub of the other two kinds of participants. It simplifies themany-to-many relationships to two many-to-one relationships. First, the providersonly need to contact CSM to on-board their services. Second, the service users onlyneed to contact CSM to obtain a specific service instead of having an overwhelmingnumber of interactions with a large number of service providers.

(2) CSM provides consumers with important information about services during theiracquisition. The service acquisition is conducted based on the interaction betweenthe users and the service recommendation algorithm. All the procedures, such asservice filtering and ranking, are conducted automatically and are purely datadriven. Compared with the human agents and the marketplaces maintained in-dividually by service providers, CSM offers a more objective assessment of theservices. Moreover, CSM is able to provide comparative information for similar ser-vices of different providers, giving the users more substantial information aboutthe pros and cons of the candidate services.

(3) CSM enables a more natural service acquisition approach. CSM provides an it-erative conversational service acquisition approach. This approach enables thecustomers to depict their requirements in a more flexible way and helps them tocontinually narrow down the candidate services based on the known information.Moreover, this approach gives the customers better experiences of acquiring properservices. It overcomes the limitation of keyword-based search on complex searchrequirements.

As CSM is a comprehensive project developed by people from the areas of data mining,semantic web, information retrieval, service management, and cloud computing, it isnot likely to introduce the entire system in detail. To make the description relevant toinformation system, in this article, we mainly focus on the client-side modules that aremore relevant to data mining, semantic web, and information retrieval.

To summarize, the contributions of this article are as follows: (1) we investigatedthe limitations of the current service acquisition market and proposed a new approachto simplify the information exchange for the market participants; (2) we proposed thesystem called CSM to facilitate the service acquisition via the techniques of data min-ing, semantic web, information retrieval, and natural language processing; and (3) toevaluate the effectiveness and the efficacy of CSM, we conducted various experimentsand case studies on the system modules with both real and synthetic datasets.

1.4. Roadmap

The rest of the article is organized as follows. We introduce CSM’s features and ar-chitecture from a high-level perspective in Section 2. Section 3 elaborates how ourservice acquisition system helps customers to quickly obtain the desired services viaa conversational interaction. Section 4 introduces the core modules of CSM in detail,covering the Service Knowledge Graph, Conversation Parser, Service Ranking, and Di-alog Engine. Section 5 reports the system evaluation. The related work is presented inSection 6. Finally, Section 7 concludes our article.


8:4 Y. Jiang et al.

Fig. 1. CSM architecture.

2. SYSTEM OVERVIEW

2.1. System Features

CSM is more than a traditional marketplace that only focuses on advertising andselling services. It is a data-driven system that offers a rich set of features to facilitatethe cloud service acquisition, including the following key features:

—CSM supports user-friendly conversational interactions. In CSM, service acquisitionis conducted via a natural language conversational interface. This approach allowsthe customers to elaborate on their complex requirements in an iterative way. Ineach iteration, CSM leverages NLP and text mining techniques to understand thecustomers’ intention and incrementally determines their potential requirements viatheir input.

—CSM enables an efficient service acquisition by simultaneously filtering, ranking,and configuring candidate services. Filtering and ranking are useful procedures tohelp the customers quickly discover the services they need, while configuring isa necessary step for the service customers to obtain proper customized services.To reduce the service acquisition time, CSM leverages a novel candidate filteringalgorithm to eliminate as many candidate services as possible in each iteration,following the minimal efforts policy. Moreover, an advanced service ranking method isleveraged to provide the domain-specific service ranking based on the heterogeneousservice knowledge graph. By profiling the configuration parameters during filtering,CSM partially avoids the redundant steps for service configuration.

—CSM has a rich knowledge about its IT services. To better “understand” customers’requirements, CSM leverages semantic web technology to describe service concepts.The service concepts and their relationships are modeled in the form of ontology andare stored in a service knowledge graph. By following the ontology, we transformthe available information into a heterogeneous knowledge graph. Currently, theknowledge graph is mainly populated by the crawler that collects the metadata for2,450 services from the service websites. In the future, the registration will be openedto service vendors and they can register their new services to CSM.

2.2. System Architecture

The high-level architecture of CSM is illustrated in Figure 1. In general, CSM canbe divided into three parts: Service Customer Module, Service Provider Module, and



Service Marketplace Module. As we focus on discussing service acquisition in thisarticle, we mainly introduce the Service Customer Module and Service MarketplaceModule in detail.

The Service Marketplace Module serves as the core of CSM. It maintains and providesfunctionalities to manipulate the service metadata such as the knowledge data, theparsed requirement, the ranking results, and the service request data. This moduleincludes Service Knowledge Graph, Semantic Query Engine, Service Ranker, ServiceConfigurator, and Service Registrar. Each item is in charge of one particular set of taskssuch as knowledge storage, knowledge retrieval, service ranking, candidate serviceconfiguration, and new service profile registration, respectively. The Service CustomerModule handles the interaction with the customers, the parsing of the requirements,and the guidance of the conversation, as supported by the Conversational Interface,Conversation Parser, and Dialog Engine, respectively.

3. CONVERSATIONAL SERVICE ACQUISITION

The major distinction between CSM and traditional online marketplaces is how CSMinteracts with the customers. As previously mentioned, CSM’s methodology is iterativerequirement elicitation. This design is necessary due to the characteristic that mostof the IT services are complex, which requires a lengthy description and sophisticatedservice configuration/customization. Given that neither keyword search nor facetedsearch is applicable, an intuitive way is to let the customers describe their require-ments in natural language and leverage the incremental elicitation to conduct serviceacquisition. Concretely, the customers do not need to provide all requirements at once.They only need to initially provide partial or high-level requirements at first. Then,based on the feedback from CSM, the users can provide more details iteratively.

Two practical considerations make this method a good choice: (1) The feedback fromthe system can inspire the customers and reinforce them to make the correct decision.Due to the complexity of service requirements, it is not easy for the customers toorganize their description succinctly, clearly, and completely from the first interactionwith a system. In the real world, it is not easy for people to express what they needat the very beginning. But it is easier for people to answer the yes/no question if thesystem can ask such questions. (2) Inferencing customers’ intention in an interactiveapproach reduces the risk of misunderstanding customers’ complex requirements. It isalways easier and more accurate to parse and handle simple sentences than complexparagraphs. State-of-the-art natural language processing (NLP) techniques still havedifficulty in precisely understanding the meaning of a long paragraph with complexsyntax and semantics. Moreover, as CSM is expected to respond to customers in realtime, even an NLP algorithm that is capable of understanding complex requirementsmay not be practical for CSM due to the long processing time.

Figures 2, 3, and 4 demonstrate how a customer interacts with CSM via the Con-versational Interface. In CSM, the user interface is divided into two parts: (1) the con-versation area, which includes the conversation display area and the text input area,and (2) the candidate service list area, which displays the ranked candidate servicesselected based on the conversation. As shown in the conversation area, the customerintends to find a virtual service solution, and thus he or she tells CSM, “I need a virtualserver.” After understanding the requirement, CSM finds and displays two matchingservice categories, “Virtual Infrastructure” and “Storage,” and waits for the customer’snext input (Figure 2). Afterward, the user asks CSM to display all the related services.Accordingly, CSM retrieves all the relevant services and displays them on the right sideaccording to their ranking score, which will be introduced in Section 4.4. To further fil-ter the candidate services, CSM provides the customers with extra questions and waitsfor the feedback to rule out unqualified candidates and conduct service configuration.


8:6 Y. Jiang et al.

Fig. 2. Screenshot: service retrieval.

Fig. 3. Screenshot: service filtering.

Through a series of iterative questioning-answering procedures (see Figure 3), CSMfinally learns the real intention of the customers and discovers the matched service(see Figure 4).

4. CORE MODULES

This section presents the details of the the core modules of CSM.

4.1. Service Knowledge Graph

Obtaining the service knowledge is the foundation to enable CSM to conduct ser-vice acquisition. In CSM, we maintained a Service Knowledge Graph (SKG). To



Fig. 4. Screenshot: service configuration.

Fig. 5. Ontology of services knowledge.

appropriately represent the knowledge, we talked with IT service domain experts anddefined an ontology to model the entities and relations. Following the schema of theontology, we represent the knowledge as a heterogeneous graph.

Compared with Entity-Relationship [Chen 1976]-based knowledge modeling (theway people store the data in a relational database), an advantage of modeling theknowledge with a graph and the relationship with ontology enables more knowledgeto be expressed with less data. This is because the extra information can be inferredaccording to the reasoning rules defined in the ontology. Moreover, by modeling theknowledge with a heterogeneous graph, a lot of graph mining algorithms can be directlyused to facilitate the task of service acquisition.

Figure 5 shows part of the ontology we defined with the domain experts (the wholeontology is too complex to show). Some of the key concepts are explained as follows:


8:8 Y. Jiang et al.

—Service is the central concept in service ontology. In our design, each Service isonly a container; it contains one or more ServiceFunctions and only belongs to oneServiceProvider. It also has several basic features like Name and Description.

—Each ServiceFunction contains one or more Capabilities and Configs. Moreover, aServiceFunction would belong to only one FunctionCategory, which is a subclass ofCategory. In our design, a service is categorized indirectly by the ServiceFunction itcontains. This is because any ability/functionality it has is supported by a concreteServiceFunction. Therefore, it is natural to categorize the services by the Service-Function.

—Config records the configuration information of a ServiceFunction. There are threesubclasses of Config: ConfigGroup, ConfigNumeric, and ConfigText. ConfigGroup pro-vides a collection of choices for configuration. The customer may select zero, one, orseveral options based on the corresponding SelectionType. ConfigNumeric allows thecustomer to enter a number to configure the service. For example, the number ofusers allowed to simultaneously use the service is a kind of ConfigNumberic. Config-Text allows the customer to enter a piece of text to configure the service, such as theUniform Resource Identifier (URI) or user name for an account.

—Each ServiceFunction can depend on one or more ServiceFunctions or FunctionCat-egories. Taking the web server as an example, Microsoft’s Internet Information Ser-vices (IIS) depends on a particular ServiceFunction named Microsoft ManagementConsole, a function that belongs to the service Windows Virtual Machine. Anotherweb server, Apache Tomcat, instead of depending on a concrete ServiceFunction, de-pends on a FunctionCategory called Virtual Server. Besides the entities, there aremainly three subrelationships of dependsOn: requires, optionalDependsOn, and can-BeReplacedBy. Their meanings can be explained as follows: Suppose service functionA depends on B. If A requires B, then A cannot work without B. optionalDependsOnillustrates that one of the extensible capabilities of A depends on B. If the extensiblecapability needs to be activated, then B is necessary. canBeReplacedBy illustratesthat one of the submodules of A can be replaced by B, but the replacement is notmandatory.

—We leverage WordNet [Miller 1995], a semantic database for English, to tag theFunctionCategory. The tags will be used during the conversation parsing and canhelp CSM to quickly locate the semantically related FunctionCategory based on theinput. More details about the usage of WordNet tag will be introduced in Section 4.2.

We also give an example heterogeneous graph in Figure 6 to better explain the definedservice ontology. As shown in Figure 6, the service IBM Smart Cloud Enterprise [IBMSmart Cloud Enterprise Plus 2009] is providedBy IBM Corporation and contains IBMVirtual Server. IBM Virtual Server belongs to the category Virtual Infrastructure Cate-gory. Moreover, this service contains two Capabilities, including Fast on-demand accessto secure virtual server and Enterprise class private cloud, and three Configs, includingVM (Virtual Machine) count, Guest OS (Operating System), and Storage Strategy. Forthe three Configs, VM count is in type ConfigNumeric with Min set to 0 and Max set to10,000. Guest OS and Storage Strategy are both in type ConfigGroup and each containsseveral configuration options.

4.2. Conversation Parser

Instead of leveraging complex NLP techniques, the customer’s input is parsed in aquick and simple way in CSM. Three reasons lead us to make this decision: (1) Theiterative style of service acquisition reduces the semantic complexity of the input.In an iterative conversation, the average length of a customer’s input is decreased,since the customers do not need to input all their requirements at once. Therefore, thesemantic complexity of the input is reduced. (2) The customers have no motivation to



Fig. 6. Example knowledge related to IBM Smart Cloud Enterprise Plus.

input complex sentences. The goal of the customers is to find proper services instead ofchallenging CSM, so there is no motivation to intentionally input confounded sentencesto trouble the system. (3) It is not practical to apply complex NLP parsing in real-timesystems. As mentioned before, CSM needs to interact with the customers in real time.It is not reasonable to put too much effort into advanced NLP parsing that requiresmore time.

In the current version, CSM makes use of the words related to service categoriesand capabilities. Based on the assumption that the users are less likely input complexcompound sentences containing double negative, subjunctive, or if-else, CSM treats theinputs as directive sentences. Concretely, the parsing can be described in the followingsteps: (1) Extract verbs and nouns from sentences. (2) Remove stop words. (3) Conductstemming and lemmatization for the remaining words. (3) Divide words into includeand exclude sets based on the grammar structure in the sentence. Each word in aset is called a seed word. (4) Propagate these two sets with WordNet. (5) Find thecorresponding category/capability via WordNet tag annotation.

To better understand how the sentences are processed, we give an example inputfrom a customer as follows: I need to pay my employee, and I want the service togenerate the reports. Based on this sentence, the following words would be extracted:need, pay, employee, want, service, generate, reports. Then three stop words, need, want,and service, would be removed. The last word, service, is treated as a stop word becauseit is too common in the realm of IT services. Since there is no negation in the sentence,all the remaining words (pay, employee, generate, reports) are put into the includeset. Thereafter, we find the associated synsets (including the synonyms, hypernyms,meronyms, etc.) of each word via WordNet to propagate the set.

Finally, a semantic query is generated to retrieve all related Function Categoriesand Capabilities that are annotated with the words in the include set. In principle,querying the semantically matched entities is a subgraph mining procedure [Zou et al.2011]. As semantical inference is time-consuming, we leveraged the parallel inferenceengine [Urbani et al. 2012] to materialize all the inferred knowledge offline. Thisstrategy increases the storage cost but significantly reduces the processing time.

4.3. Dialog Engine

The Dialog Engine is the centerpiece that guides the conversation, coordinates inputprocessing, and executes back-end knowledge discovery tasks. Figure 7 illustrates howthe dialog engine works with other modules in one iteration of conversation. For each


8:10 Y. Jiang et al.

Fig. 7. Workflow of the dialog engine.

user session, Dialog Engine assigns a thread to handle the user’s requests throughthe entire service acquisition procedure. The metadata of this procedure will also berecorded, including the current state of the procedure, the information about the re-quests, the qualified service candidates, and so forth. At the very beginning of theservice acquisition procedure, the users tell CSM about their requirements via theconversational interface. Afterward, the Conversation Parser would parse the input,transform the result into predefined semistructured data, and send it to Dialog Engine.Upon receiving the parsed data, Dialog Engine would conduct corresponding actionsbased on the conversation context. If necessary, Semantic Query Engine will be invokedto retrieve the needed knowledge from Service Knowledge Graph via a semantic query.After receiving the retrieved data, the Dialog Engine would update the metadata aswell as the conversation context to continue the conversation. Finally, Dialog Enginewould send a response back to Conversation Parser, which will reversely parse the re-sponse back into natural language using the metadata stored and the predefined texttemplates. Two main tasks for Dialog Engine include conversation flow control andeffective services filtering. The two tasks will be discussed in detail in Sections 4.3.1and 4.3.2, respectively.

4.3.1. Conversation Flow Control. Since CSM is used as the service acquisition portal,the conversation between the customers and CSM should be controlled in the contextof service acquisition. Otherwise, the conversation can be deviated gradually.

To control the conversation, we defined the logic flow with the IT services domainexperts. The logic flow is modeled as a directed graph G(V, E), where vertices denotethe context states and edges denote the transitions between contexts. Figure 8 showsthe main part of the logic flow topology, with eight context states (along with severalexception handling states) and 13 transitions. Logically, CSM uses finite state machine(FSM) to represent the information of the conversation logic flow. FSM enable us toflexibly update the logic flow if there is a better design.



Fig. 8. Conversation logic flow.

At the beginning of each conversation (recognized as the creation of a new session),Dialog Engine creates the profile for the current session and sets the current statesas Service Category Identification. In this state, CSM assumes that the customer’sinput is related to the scenario of looking for proper service categories. If CSM cannotfind useful information from the input regarding the service categories, it would entera corresponding exception handling state and prompt the customer to input again. Inprinciple, the workflow of how each state processes customers’ (parsed) input is similar,and the only difference is the conversation context for the input and the correspondingactions.

4.3.2. Interactive Services Filtering. After identifying service categories, a customer wouldreceive a list of candidate services. Usually, there are a large number of candidatesand it is impractical for the customer to browse through the whole list, especiallywhen some of the services are not well known. One key feature of CSM is to askclarification questions to refine customers’ initial requirements. Unqualified servicesare ruled out through a series of questions and answers regarding service capabilityand configuration. To make the system effective, CSM needs to ask as few questions aspossible to prune as many candidate services as possible.

Treating the services as data labels and the service capability and configurationquestions as features, the problem of question generation can be described as follows:given the initial input of a customer, we are looking for a set of features, knowingwhose values would reduce the uncertainty of matching results. In each iteration, theundetermined feature with the most distinguishing power will be asked as the nextquestion (e.g., Do you need the capability X?; Please make a choice about the followingoptions: 1. A, 2. B, 3. C.; or Please enter the value about Y.).

Formally, let y be a candidate service, xi be a feature that corresponds to a question,and vi be the customer’s answer to the ith question, that is, xi = vi. We assumethat the question’s answers (i.e., attribute values) are categorical. For features withcontinuous values, we can discretize them [Han 2005]. Let feat be the set of all features,obs = {x1 = v1, . . . , xd = vk} be the map of features and their corresponding values that(i.e., question answers) are currently known, keys(obs) be the involved features ofobs, and un be the set of the remainder features, that is, un = feat − keys(obs). Tomeasure the uncertainty, we empirically use the entropy of the outcomes given the



known feature values, that is,

H(y|obs) def= −∑y∈Y

Pr(y|obs) log Pr(y|obs). (1)

The entropy, H(y|obs), is the entropy of a service y given particular obs, andPr(y|obs) can be estimated as the number of candidate services that satisfy obs overthe number of all services that satisfy obs. The entropy shows that selecting an optimalset of features that minimize the uncertainty, that is, finding opt ⊆ un to minimizeH(y|obs, opt), is NP-hard [Karp 1972]. In CSM, we use a greedy select strategy to pickthe next question to probe and will continue to ask extra questions whenever the userprovides the answer for the current question. For each remaining feature i ∈ un, theuncertainty of y after probing xi is needed, that is, H(y|obs, xi). Then the entropy of allthe remaining features is ranked in ascending order of H(y|obs, xi), and the top-rankedfeature is used to generate the next question, that is,

xnext = arg mini∈un

H(y|obs, xi). (2)

However, H(y|obs, xi) cannot be computed since the value of xi is unknown be-fore the probing. To address this issue, we estimate the entropy as the expectation ofH(y|obs, xi) over the distribution of xi instead [Zhu et al. 2008], that is,

ˆH(y|obs, xi) =∑v j∈xi

Pr(xi = v j)H(y|obs, xi).

The distribution of xi is initialized as uniform distribution and is continuously updatedaccording to the usage log recorded.

Note that our method is similar to maximizing information gain in decision-tree-building algorithms [Quinlan 1986], except the estimated average entropy is usedin CSM (since the feature values are unknown when building the tree) instead ofthe information gain. Different from active learning, which selects unlabeled data forlabeling to improve the classification accuracy [Cohn et al. 1996; Settles 2009], ourmethod is to probe unknown features or ask clarification questions.

In CSM, three types of questions regarding capability, ConfigNumeric, and Config-Group are used. We separate the filtering into two states to make the logic flow matchthe real workflow of service acquisition. As shown in Figure 8, services are first filteredby capability and then configuration. In the following, we illustrate how we pick thenext configuration for filtering. Picking the capability is conducted in a similar way.

4.4. Service Ranking

During service filtering, CSM generally returns a list of services that match the cus-tomer’s current requirements. Ranking these services and understanding the relativeimportance of a service provider on certain service products are critical and may havea direct impact on the effectiveness of the acquisition process.

As is known, ranking plays a key role in web search systems to convey the relativeimportance of web pages. Prevailing ranking algorithms, such as HITS [Kleinberg 1999]and PageRank [Page et al. 1999], as well as their extensions, rank the entities based onthe topological links residing in a web page. However, ranking services is more complexthan ranking web pages as it involves multiple types of entities and features, includingservice capability, service configuration, service providers, and so forth [Peng et al.2012]. Moreover, the link-based relationships embedded in services are well beyondthe explicit hyperlinks and involve much more complex interrelated structures. To wellpreserve the relationship among these entities, we store the information of the services



in the form of semantic web (more details are available in Section 4.1), which inherentlyis in the form of a heterogeneous graph. Due to the complex structures of heterogeneousgraphs, none of the previously mentioned ranking algorithms is sufficient to handlethe service ranking.

To conduct the service ranking on a heterogeneous graph, we leverage the ServiceR-ank method we have proposed [Zhou et al. 2013] that makes use of the informationabout the links among services S and service providers P. In the following, we introducehow ServiceRank is applied in CSM. The service graph is represented as G = (V, A, E),where the entities V = S ∪ P denote the set of service S vertices and service providerP vertices, A denotes the set of feature vertices (such as service capability and serviceconfiguration), and E denotes the set of edges between any two types of entities.

Based on the graph link structure, this method first conducts a probabilistic clus-tering on the graph to cluster the services into k partitions. Then, in each partition, alocal ranking is performed according to the heuristic rules.

4.4.1. Probabilistic Service Clustering. In the presence of a heterogeneous graph, rankingthe entities (in our scenario, the services) is not as trivial as ranking the web pagein the homogeneous graph. This is because the influence of an entity cannot be eas-ily quantified using the approach for a homogeneous graph [Sun and Han 2013]. Inthe heterogeneous graph, different types of attributes will have different influences onthe neighbors. Moreover, in our scenario, services have the natural property that theybelongs to various categories, so ranking the services that belong with quite differentfunctionalities would be less meaningful. For these reasons, ServiceRank tightly inte-grates ranking and clustering by mutually and simultaneously enhancing each otherso that the performance of both can be boosted. The superiority of ranking with thehelp of clustering has been proved in the work of ServiceRank.

To perform service clustering, a random walk distance matrix R according to therandom walk with restart model [Tong et al. 2006] needs to be generated according toEquation (3):

R =l∑

i=1

c(1 − c)iT i. (3)

The matrix R contains the information about how cohesive two entities are, fromthe perspective of the attributes’ relationship in the heterogeneous graph. In thisperspective, the cohesiveness of two entities is measured as the probability that oneentity can visit the other one via a random walk with restart. The probability reflectsthe chance that two entities belong to the same cluster.

In Equation (3), l denotes the number of steps that a random walk can perform,c denotes the restart probability, and T denotes the transition probability defined inEquation (4):

T =[ TSS TSP TSA

TPS TPP TPATAS TAP TAA

]. (4)

In the transition probability matrix T , TXY is an |X| × |Y | matrix whose entry is thetransition probability between an entity with type X and an entity with type Y .

The transition probability is quantified as

TXY ={αXY , ∃vi, v j ∈ V, such that eij ∈ E0, otherwise

. (5)

In Equation (5), αXY denotes the weight between vertex type X and type Y (e.g., aweight between service and service capability). The weight is determined using the



dynamic weight tuning method [Zhou and Liu 2011], which is specifically designed forheterogeneous graphs.

Leveraging the random walk distance matrix, the services can be clustered via ser-vice influence propagation. The propagation is conducted in a way similar to the heatdiffusion process [Kempe et al. 2003]. The theory of estimating the transition proba-bility and conducting the influence propagation is available in our earlier work [Zhouet al. 2013].

4.4.2. Local Service Ranking. Obtaining the clustering results, the local ranking canthen be performed accordingly. In general, two heuristic rules give us initial ideas:

—Highly ranked services are likely to have connections with other highly rankedservices in the same cluster.

—Highly ranked services are likely to be provided by highly ranked providers in thesame cluster.

The service ranking can be obtained by applying ServiceRank, which is an iterativealgorithm built on top of the heat diffusion process theory. Initially, the ranking scoresr(0)(i, j) of the ith service in the jth cluster is defined as

sr(0)(i, j) ={∑|S|+|P|

l=1,l �=i,sc(l, j)>0 H(i, l), if sc(i, j) > 0

0, otherwise, (6)

where H is the heat diffusion kernel and sc(x, y) ∈ [0, 1] denotes the probability thatentity x belongs to cluster y.

The heat diffusion kernel H is an |S| + |P| dimension square symmetric matrix thatcaptures the influence of services (as well as other types of entities) through direct andindirect edges, that is,

H =[

RSS RSPRPS RPP

], (7)

where RXY is the submatrix containing the pairwise distances between the services (S)and the service providers (P).

The pairwise distances in RXY are quantified using the random walk with restartmodel [Tong et al. 2006]. Concretely, the distance between the ith and the jth entityis quantified as the sum of the reachable probabilities of all the paths between them,that is,

R(i, j) =∑

τ :u�vlen(τ )<l

p(τ )c(1 − c)len(τ ), (8)

where τ denotes the path between entities i and j, c denotes the restart probabil-ity, len(τ ) denotes the length of path τ , and p(τ ) denotes the transition probabilityquantified through random walk.

The calculation of final rankings scores is conducted as follows:

(1) Normalize each sr(0)(i, j) in Equation (6) so that∑|S|+|P|

l=1 sr(0)(l, j) = 1. In this case,

sr(0)(i, j) = sr(0)(i, j)∑|S|+|P|l=1 sr(0)(l, j)

.

(2) Iteratively update the ranking scores according to sr(t)(i, j) = sr(t)(i, j)∑|S|+|P|l=1 sr(t)(l, j)

.

(3) By iteratively updating the ranking scores until convergence, the final rankingsscores of the jth cluster can be calculated as sr(t)(:, j), where sr(t)(:, j) denotes theranking score of each service in its cluster j at iteration t.



The refinement procedure can be intuitively explained from the perspective of theheat diffusion process. During the influence propagation, the rankings of the ser-vices are continuously updated by the rankings of their neighbor entities (service orprovider). When the whole system reaches equilibrium, the final ranking results ofthese services are determined by both the ranking scores of the services and providersin the same cluster.

4.4.3. System Integration. In CSM, service ranking is conducted offline, once a day. Thereare mainly two reasons for doing so:

(1) Service ranking costs time. It is not possible to conduct ranking on the fly due tothe scale of the service knowledge graph. The efficiency of ranking the services willbe discussed in detail in Section 5.2.

(2) The service knowledge is relatively stable, so there is no need to repetitively conductservice ranking. Typically, the provided services are stable. From the marketplaceperspective, it is not likely that the constituent of services would change dramati-cally every hour.

The service ranking scores will be stored in a particular data repository. Duringthe questioning-answering procedure, when the candidate services are retrieved, theirassociated ranking scores will be retrieved as well. The service ranking score is notthe only factor that decides the actual ranking. As the service categories defined in theservice knowledge graph and the underlying clustering results used in ServiceRankingare not perfectly matched, the services in one category may belong to different clusters.To uniformly rank the services, the similarity between the specified service categoryand the cluster is calculated and used combined with the service ranking score.

Suppose the chosen service category C = {s1, s2, s3, . . . , sn} contains n services andthese services belong to m distinct clusters sc1, sc2, . . . , scm. The similarity betweenC and sci is quantified using the Jaccard Similarity [Pang-Ning et al. 2006], that is,J(C, sc j) = |C∩sc j |

|C∪sc j | . The final ranking scores of the services with respect to the chosenservice category are quantified as rs(C) = J(C, sc j)sr(i, j), where i is the index of theservice in its cluster and j is the index of the cluster.

5. SYSTEM EVALUATION

In this section, we focus on the service filtering and ranking aspects for CSM. The eval-uation of the aspects of service deployment performance, virtualization management,and system fault tolerance are beyond the discussion of this article. To demonstrate theeffectiveness and efficacy of CSM, we evaluate the system on both synthetic and realdatasets. First, we introduce the real dataset used in the CSM system and describehow we populated the knowledge graph. Second, we demonstrate the effectiveness ofthe service ranking module by presenting the experiment results on both real andsynthetic datasets. Finally, we present a case study to demonstrate the efficacy andeffectiveness of CSM.

5.1. The Service Knowledge Graph

To populate the knowledge graph, we implemented a crawler to grab the service profilesfrom target websites.1 Then, for each service function, we leveraged WordNet [Miller

1There are mainly two sources for the service data. The first one is the API provided by the service providers.The second one is the third-party web service repository, which gathers the information of the servicedocuments and APIs and converts them in a standard way, and then releases their own API and documents.Both of the two sources provide the standard interface that allows the program to fetch the data.



Fig. 9. Entity distributions in SKG.

1995; Fellbaum 1998] to tag a set of semantically related words to FunctionCategory,using hypernym, meronym, and holonym.

Currently, SKG contains 67 distinct service categories and 2,450 services. Theseservices are provided by 928 service providers. Moreover, 14,643 distinct service ca-pabilities are used to describe these services. The statistics show that SKG covers awide range of areas and a large number of services. To investigate how the services aredistributed, we group the services by categories and service providers, and then rankthem by the counts in descending order (see Figures 9(a) and 9(b)). We also list the topcategories and service providers in Tables I and II, respectively. As shown in both thefigures and tables, the services are not uniformly distributed among categories or ser-vice providers, as a large portion of the services belong to a small portion of categoriesor are provided by a couple of large service providers. In detail, we find that less than5% of service providers provide five-plus services, while the other 90%+ providers haveonly one or two services each. Such distribution of services makes search by providerefficient, since once the service provider is determined, the corresponding services arelikely to be quickly located. However, in practice, customers would either directly searchthe name of the service or search a proper service by providing the requirement (thescenario we address in this article). Searching a service via the service provider’s nameis not typical.

Due to the complexity of the service profile, the large number of services in eachservice category makes the attempt of finding proper service by searching the cate-gory difficult. Moreover, if each service is classified into multiple categories, or if thecustomer cannot well describe what category of service he or she is interested in, thedifficulty of locating proper services by browsing would be further increased.

5.2. Service Ranking Evaluation

To evaluate the performance of the service ranking module, we conducted extensiveexperiments on both the real and synthetic datasets. Two reasons drive us to includea synthetic dataset in our experiment: (1) the scale of the real service dataset wecurrently have is not large enough and (2) we cannot freely control the characteristicsof the topology of the Service Knowledge Graph in the real dataset.



Table I. Top Service Categories

Name # of Services

Financial 50Search 50Storage 50

Enterprise 50Payment 50

Table II. Top Service Providers

Name # of Services

Google 28Amazon 17

Microsoft 11IBM 11AOL 10

5.2.1. Synthetic Dataset. We implemented a knowledge graph generator that is modifiedbased on the Berlin SPARQL Benchmark (BSBM) data generator [Bizer and Schultz2009]. Using this data generator, we created a synthetic service knowledge graph with10,000 Services, 3,628 Providers, and 500 service Capabilities. For each Service, thereis one service Provider and the average number of capacities is set to five. Totally, thisservice knowledge graph is a heterogeneous graph with three kinds of entities, 14,128vertexex, and totally 246,161 triplets.

5.2.2. Comparison Methods. For a systematic evaluation, we compare ServiceRank withthree widely used graph ranking algorithms: HITS [Kleinberg 1999], PageRank [Pageet al. 1999], and SimRank [Jeh and Widom 2002]. To make these three algorithmswork properly on the heterogeneous service knowledge graph, we treat the nodes ofService, Provider, and Capability as the same type. Additionally, we pick k-means++for these three ranking algorithms as k-means++ is guaranteed to find a solution thatis O(logk) competitive to the optimal k-means solution [Arthur and Vassilvitskii 2007].

5.2.3. Ranking Quality Evaluation. To illustrate the ranking quality, the average rank-ing scores of the top-k services in each cluster of the BSBM dataset and IBM datasetare plotted in Figures 10(a) and 10(b), respectively. The ranking scores are calculatedaccording to the method introduced in Section 4.4. The goal of this evaluation is notto show whether the average ranking score of ServiceRank is higher or lower thanthose of the counterparts, but to show that the service clusters obtained via ServiceR-ank are more cohesive than the other algorithms. The cohesiveness is reflected bythe degree of changes for the average score as k changes. As is shown, the rankingscore curve of ServiceRank is more stable than any other curves generated by alterna-tive algorithms. Especially, the change of the scores for ServiceRank is almost negligiblewhen k is between 20 and 50. This is because ServiceRank well clusters the services inthe service knowledge graph, as it is able to maximize the intracluster similarity andminimize the intercluster similarity. Furthermore, the iterative ranking refinementmechanism makes the formed clusters more cohesive.

The ranking quality of PageRank and HITS is worse because they rank the serviceswithout making use of the heterogeneous information. As for SimRank, it does notconsider the relationship between scores between the contingent iterations. However,its ranking quality is better than PageRank and HITS as it iteratively refines thepairwise correlation between two adjacent vertices.

5.2.4. Ranking Efficiency. The efficiency of service ranking is quantified by the runningtime. Figures 11(a) and 11(b) show the running time for ranking the services in theBSBM dataset and IBM dataset, respectively. As is shown in this experiment, for bothdatasets, ServiceRank took the longest running time to calculate the ranking results. Itis about 1.05 to 1.16 times slower than SimRank, which is in turn 1.5 to 4 times slowerthan HITS and PageRank. This is because ServiceRank needs to iteratively compute therandom walk distance matrix from scratch on the service knowledge graph. Moreover,



Fig. 10. Ranking quality evaluation.

Fig. 11. Ranking quality evaluation.

the service probabilistic clustering and ranking are conducted in an iterative approach,which requires a considerable amount of running time. SimRank has moderate speedas it also involves an iterative procedure to calculate the pairwise similarity for theservices in the service knowledge graph. HITS and PageRank are the most efficientbecause they only conduct the clustering and ranking once on the service knowledgegraph.

5.3. Case Study

In this section, we demonstrate several cases on how people interact with CSM. Aspreviously mentioned, people who use the system only need to provide CSM with abrief description on what they need at the beginning, and then CSM would guide theconversation by prompting them with heuristic questions. Since all the conversationsare in natural language, no particular training is needed for using this system.

In the case study, the people who use the system are attempting to find 10 properservices that satisfy their requirements (including one meaningless request as shownin Table III). As a counterexample, one meaningless requirement is also asked to test



Table III. Inputs and Results of Case Study

No. Initial Input Category Candidates# Service

Candidates # Iterations

AverageResponse

Time

1I need an enterprise-level

virtual infrastructure.Virtual Desktop, Virtual Infrastructure,

Transportation, Project Management 6 3 271ms

2 I want to pay my employees. Payroll, Payment 57 5 305ms

3 I want to host a mobile game. Search, Games,Politics, Education 191 9 375ms

4Show me the services that allow

me to store data in the cloud.Backup and Recovery, Storage,

Transportation, News, Job Search 53 5 351ms

5 I want to arrange a company event. Events, Music, Travel,Government, Calendar 148 8 409ms

6 I’d like to host an online blog. Storage, Blog Search, Government 10 4 302ms7 I’d like to use a word processor online. Utility, Search, Job Search 48 7 707ms8 I need to host an online music site. Portal, Music, Politics 48 6 682ms9 I need to set up an online voting system. Game, Government, Search, Calendar 49 5 715ms

10 I need nothing. No service category matchingyour requirement is found. N/A N/A 1,189ms

the reaction of CSM. Table III lists the sample inputs and the related results aboutCSM’s responses, including categories, number of service candidates initially found,number of iterations for service filtering, and average response time.

According to the discovered candidate services, CSM is able to capture the intentionof people based on their initial inquiry. This is achieved by the tagging mechanismduring knowledge graph population. We also found that there are two problems withthe results: some irrelevant results are listed, and the categories are unordered. This isbecause, to balance the efficiency and the effectiveness, CSM does not conduct advancedNLP techniques on the input, but just extracts ALL nouns/verbs and returns ALLservice categories with semantically related tags. This approach returns more servicecandidates than expected and therefore degrades the precision. However, since thefollowing conversation iterations are able to gradually increase precision (we will showthis in the next section), this limitation does not degrade the user experience much.Moreover, as service ranking is able to give the more relevant services with higherrankings, the system will gradually filter out the less relevant services.

5.4. Filtering Methods Comparison

As we mentioned in Section 4.3.2, CSM leverages the reduction of entropy to minimizethe number of conversation iterations (in this section, we name it as Entropy-Min).In this section, we investigate whether the proposed method is really effective. Forcomparison, we implemented two other filtering strategies, Random and Particular-First. Random, as its name implies, randomly picks a capability/configuration as thefiltering condition. Particular-First picks the most particular capability/configuration(owned by the least number of candidates) as the filtering condition. The intuition isthat once people who use the system confirm this capability/configuration, the searchspace can be significantly reduced.

To quantitatively compare these methods, we use precision to evaluate their effec-tiveness. We tried these methods on all the aforementioned nine valid query cases andrecorded the corresponding precision values for each iteration. Figures 12(a), 12(b),and 12(c) respectively show the iterative precision for each strategy. Since the dialogswould always lead to at least one matching service in SKG, recall is always 1. Theprecision is very low at the beginning because CSM initially returns a large numberof candidates. To mitigate the randomness of the Random method, we run each query10 times and report the median of the number of iterations. From these figures, it isobvious that their behaviors are different in the following aspects:



Fig. 12. Iterative precision for filtering methods.

—Number of iterations. Among the three strategies, the number of iterations forEntropy-Min is far less than those of the other two. Even for the case of 191 candi-dates, Entropy-Min can finish in nine iterations, whereas the other two need 42 and26 iterations. This is because the question generated by Entropy-Min can rule outthe most number of candidates on average.

—Opportunism. Particular-First is purely driven by opportunism. Once the correctcapability/configuration is asked, the search space can be significantly reduced. Ifthe picked capability/configuration is unique, the proper service can be immediatelylocated. Otherwise, only the candidates with particular a capability/configurationwould be pruned. This mechanism explains why in Figure 12(b) the precision onlyincreases slightly for most of the time and then suddenly reaches the maximum ata certain time point. Random also has the factor of opportunism. If a capability/configuration with a high effectiveness score is picked by chance, precision wouldsignificantly increase.

—Scalability. The number of iterations for Random and Particular-First increaseslinearly as the number of candidates increases, while that of Entropy-Min increasesin log2 N scale. This is because the probability of picking the correct particularcapability/configuration decreases proportionally as the number of candidates in-creases. It’s getting harder and harder for Particular-First to pick an effective ca-pability/configuration by chance. Random also has difficulty picking an effectivequestion by chance. Different from the other two, Entropy-Min aims to find the capa-bility/configuration that approximately halves the candidates, so the candidates canalways be reduced at a logarithmic scale.

To further investigate how fast Entropy-Min reduces the search space, we also com-pare its effectiveness with service filtering in the ideally expected case where half of thecandidate services will be removed in each iteration (we name it Log-Elimination). Asknown from previous sections, in each step, the system will provide a question and waitfor a choice. As shown in Figure 3, the choices are “yes,” “no,” and “don’t care.” Eachquestion can help the user to rule out about half of the services, given the assump-tion that the service capabilities are uniformly distributed. In this case, it only takeslogarithm number of steps to rule out all unqualified services in the ideally expectedcase. To illustrate how the Entropy-Min fits to Log-Elimination, we plot the numberof candidate services in each step for all the examples in the case study in Figure 13.As a comparison, we also include the results for Particular-First. In this experiment,we do not include the results for Random. This is because in each trial, the number ofremaining candidate services in each step is different when using Random. We cannottake the average for Random either as the number of steps in each trial is different.



Fig. 13. Service filtering comparison among Entropy-Min, Particular-First, and Log-Elimination cases. Forall the subfigures, the yellow line, blue line, and dashed line denote the curves of Particular-First, Entropy-Min, and Log-Elimination, respectively. It is trivial to observe that the curve of Entropy-Min fits better thanthe curve of Particular-First.

As shown in all the subfigures of Figure 13, the curves of Entropy-Min are almostalways well fitted to the curves of Log-Elimination. This is due to the characteristic thatEntropy-Min aims to find the service capability that can well reduce the search space,despite the choices of the questions. Differently, the performance of Particular-Firstdepends on the choice. If the right choice is picked, the search space can be significantlyreduced. However, the probability of picking the right choice is low, and it decreases asthe number of service capabilities increases. Typically, more candidate services oftenlead to more distinct service capabilities of these services. The issue of low probabilitypicking would drive the curve of Particular-First gradually away from the curve ofLog-Elimination. This phenomenon is sufficiently reflected in Figures 13(c) and 13(e),as these two cases respectively have 191 and 148 candidate services at the beginning.



The previous experiment results clearly demonstrate the advantages of Entropy-Min during candidate filtering. Although Particular-First requires fewer iterations bychance (query No. 1 only needs one iteration), it performs worse than Entropy-min onaverage. As the users always remembered the bad experiences, it is better to usethe better-on-average solution.

6. RELATED WORK

6.1. IT Service Ecosystem

To facilitate IT management, people proposed several IT ecosystems. Previous studiescan be grouped into three categories [Akolkar et al. 2012], and two of them are relatedto our work: (1) service search and composition and (2) service deployment workflowautomation. Battle et al. [2005] and Roman et al. [2005] proposed ontology and lan-guage as the service metadata in semantic approach. Lcu and Lger [2006] leveraged AIplanning algorithms and applied backward chaining to derive suitable matching andcomposition from a certain goal. Sebastian et al. [2008] proposed a workflow ontologyfor collaborative tasks and execution activities. CSM is different from these works intwo aspects: (1) The information that can be inferred from the Service Knowledge Graphis more than the metadata it actually stores. In our design, it is also used to store othertypes of information such as customers’ profiles, service reviews, and so forth. There-fore, the ontology we proposed describes broader concepts than the services. (2) Insteadof providing a simple catalog for service search, CSM uses a conversational interfacefor service retrieval, a more natural way for customers to search and configure services.

6.2. Feedback System

While there is no perfect method to fully capture the user’s intention from the user’sinputs, feedback is an effective way to obtain knowledge about users. Some previousstudies have proved the effectiveness of this strategy. The studies of Iwayama [2000],Kelly and Fu [2006], and Tan et al. [2007] leveraged relevant feedback for disambigua-tion. They provided the relevant information for the users to label products based ontheir inputs. Kotov and Zhai [2011] proposed sense feedback. Instead of providing therelevant information, their method revealed the ambiguity at the semantic level. Thestudies of Kotov and Zhai [2012] and Liu et al. [2012b] utilized an external knowledgegraph such as WordNet and ConceptNet for query expansion from the semantic per-spective. Different from all the foregoing studies, the interactive feedback occurs at thefollowing two levels: (1) the disambiguation of concept at the term level during serviceidentification stage and (2) the search space reduction at the semantic relationshiplevel during the candidate service filtering stage. The first level is similar to what theprevious works did. The second level is more focused on knowledge navigation. Thereare also some systems that leverage the feedback (e.g., failure rate and service delay)to improve the service quality [Liu et al. 2012a, 2012c] from the aspect of CDN opti-mization. In contrast, our work mainly focused on the service acquisition experiencefor the service users.

6.3. Data Mining Application on Autonomous System

The increasing complexity and scale of modern systems and services make the man-agement difficulty far beyond the capabilities of human beings. The emergence ofautonomous computing techniques liberates the burden of system administrators andservice providers. To improve the management efficacy, data mining techniques areleveraged to make the systems more intelligent [Li 2015]. There are mainly two direc-tions of data mining applications on autonomous systems: (1) the system autonomoustechniques and (2) the system analytical techniques. The studies in the first direction



mainly leverage data mining techniques to support the system self-maintenance. Forexample, Patnaik et al. [2009] used motif mining to model and optimize the effective-ness of the chillers in the data centers. The work in Gong et al. [2009] proposed asignature-driven approach for load balance in the cloud environment with the helpof utilization data. The second direction aims to facilitate system management byimproving the management efficiency for human beings. For this direction, Li et al.[2005, 2010a] leveraged text mining techniques to discover the latent categories ofevents and then leveraged visualization techniques to vividly illustrate the mined re-sults. The studies in Jiang et al. [2011a, 2011b, 2012, 2014] used temporal mining andencoding theory to discover the event interaction behaviors from system logs and thensummarized them with more concise representations.

These proposed studies all focused on the low-level autonomous computing thatameliorates the interactions between the machines and the system administrators.Different from these existing techniques, CSM focuses on the high-level autonomouscomputing that ameliorates the interactions between the customers and the serviceagents. The challenge is that it is more difficult to satisfy the customers than thesystem administrators, as the first group of people usually have less expertise than thesecond group of people. More importantly, the satisfaction of the customers can directlyaffect the revenue of the service providers.

7. CONCLUSION AND FUTURE WORK

In this article, we presented the design, implementation, and evaluation of the CloudService Marketplace (CSM), an intelligent online marketplace specifically designed forcloud service. CSM is designed as an ecosystem to support complex service matching forservice providers and customers. It enables the customers to effectively communicatetheir requirements with the marketplace interface via conversations. By leveraging ad-vanced data mining techniques, CSM is able to gradually understand the requirementsof the users and help them to quickly find the services they need.

There are several future directions for this system to evolve to: (1) populate theknowledge graph with more data about the services, (2) make use of the unstructuredinformation of the services by leveraging more advanced text mining and topic mod-eling techniques, and (3) migrate all the data into a distributed semantic knowledgegraph [Huang et al. 2011] and develop more efficient indexing schemes.

ACKNOWLEDGMENTS

The authors would like to thank Rahul Akolkar, Thomas Chefalas, Jim Laredo, Frank Schaffa, Alla Segal,and Tao Tao for participating in the design and development of CSM.

REFERENCES

Rahul Akolkar, Tom Chefalas, Jim Laredo, Chang-Shing Perng, Anca Sailer, Frank Schaffa, IgnacioSilva-Lepe, and Tao Tao. 2012. The future of service marketplaces in the cloud. In IEEE 8th WorldCongress on Services.

David Arthur and Sergei Vassilvitskii. 2007. k-means++: The advantages of careful seeding. In Proceedingsof the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and AppliedMathematics, 1027–1035.

Steve Battle, Abraham Bernstein, Harold Boley, Benjamin Grosof, Michael Gruninger, Richard Hull, michaelKifer, David Martin, Sheila Mcilraith, Deborah McGuinness, Jianwen Su, and Said Tabet. 2005. Seman-tic Web Services Language. http://www.w3.org/Submission/SWSF-SWSL.

Christian Bizer and Andreas Schultz. 2009. The Berlin SPARQL benchmark. International Journal onSemantic Web and Information Systems (IJSWIS) 5, 2 (2009), 1–24.

Peter Pin-Shan Chen. 1976. The entity-relationship model: Towards a unified view of data. ACM Transactionson Database Systems 1, 1 (1976), 9–36.


http://www.w3.org/Submission/SWSF-SWSL


David Cohn, Zoubin Ghahramani, and Michael I. Jordan. 1996. Active learning with statistical models.Journal of Artificial Intelligence Research 4, 1 (March 1996), 129–145. http://dl.acm.org/citation.cfm?id=1622737.1622744

Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.Zhenhuan Gong, Prakash Ramaswamy, Xiaohui Gu, and Xiaosong Ma. 2009. Siglm: Signature-driven load

management for cloud computing infrastructures. In International Workshop on Quality of Service(IWQoS). IEEE, 1–9.

Jiawei Han. 2005. Data Mining: Concepts and Technologies. Morgan Kaufmann.Jiewen Huang, Daniel Abadi, and Kun Ren. 2011. Scalable SPARQL query over large RDF graph. In Inter-

national Conference on Very Large Data Bases (VLDB’11).IBM Smart Cloud Enterprise Plus. 2009. http://ibmcloud.itosolutions.net.Makoto Iwayama. 2000. Relevance feedback with a small number of relevance judgments: Incremental rele-

vance feedback vs. document clustering. In Special Interest Group on Information Retrieval (SIGIR’00).Glen Jeh and Jennifer Widom. 2002. SimRank: A measure of structural-context similarity. In ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining (SIGKDD’02). ACM, 538–543.Yexi Jiang, Chang-Shing Perng, and Tao Li. 2011a. Natural event summarization. In ACM International

Conference on Information and Knowledge Management (CIKM’11). ACM, 765–774.Yexi Jiang, Chang-Shing Perng, Tao Li, and Rong Chang. 2011b. ASAP: A self-adaptive prediction system

for instant cloud resource demand provisioning. In International Conference on Data Mining (ICDM’11).IEEE, 1104–1109.

Yexi Jiang, Chang-Shing Perng, Tao Li, and Rong Chang. 2012. Intelligent cloud capacity management. InIEEE Conference on Network Operations and Management Symposium (NOMS’12). IEEE, 502–505.

Yexi Jiang, Chunqiu Zeng, Jian Xu, and Tao Li. 2014. Real time contextual collective anomaly detection overmultiple data streams. In KDD Workshop on Outlier Detection and Description Under Data Diversity.

Richard Karp. 1972. Reducibility among Combinatorial Problems. Complexity of Computer Computations.85–103.

Diane Kelly and Xin Fu. 2006. Elicitation of term relevance feedback: An investigation of term source andcontext. In Special Interest Group on Information Retrieval (SIGIR’06).

David Kempe, Jon Kleinberg, and Eva Tardos. 2003. Maximizing the spread of influence through a so-cial network. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(SIGKDD’03). ACM, 137–146.

Jon Kleinberg. 1999. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM) 46,5 (1999), 604–632.

Alexander Kotov and Chengxiang Zhai. 2011. Interactive sense feedback for difficult queries. In ACM Con-ference on Knowledge and Information Management (CIKM’11).

Alexander Kotov and Chengxiang Zhai. 2012. Tapping into knowledge base for concept feedback: LeveragingConceptNet to improve search results for difficult queries. In ACM Conference on Web Search and DataMining (WSDM’12).

Freddy Lcu and Alain Lger. 2006. A formal model for semantic web service composition. In InternationalSemantic Web Conference (ISWC’06).

Ang Li, Xiaowei Yang, Srikanth Kandula, and Ming Zhang. 2010b. CloudCmp: Shopping for a cloud madeeasy. USENIX HotCloud (2010).

Tao Li. 2015. Event Mining: Algorithms and Applications. CRC Press.Tao Li, Feng Liang, Sheng Ma, and Wei Peng. 2005. An integrated framework on mining logs files for

computing system management. In ACM SIGKDD International Conference on Knowledge Discovery inData Mining (SIGKDD’05). ACM, 776–781.

Tao Li, Wei Peng, Charles Perng, Sheng Ma, and Haixun Wang. 2010a. An integrated data-driven frame-work for computing system management. IEEE Transactions on Systems, Man and Cybernetics, Part A:Systems and Humans 40, 1 (2010), 90–99.

Hongqiang Harry Liu, Ye Wang, Yang Richard Yang, Hao Wang, and Chen Tian. 2012c. Optimizing costand performance for content multihoming. In Proceedings of the ACM SIGCOMM 2012 Conference onApplications, Technologies, Architectures, and Protocols for Computer Communication. ACM, 371–382.

Shuang Liu, Fang Liu, Clement Yu, and Yiwei Meng. 2012b. An effective approach to document retrievalvia utilizing wordnet and recognizing phrases. In Special Interest Group on Information Retrieval(SIGIR’12).

Xi Liu, Florin Dobrian, Henry Milner, Junchen Jiang, Vyas Sekar, Ion Stoica, and Hui Zhang. 2012a. A casefor a coordinated internet video control plane. In Proceedings of the ACM SIGCOMM 2012 Conference onApplications, Technologies, Architectures, and Protocols for Computer Communication. ACM, 359–370.


http://dl.acm.org/citation.cfm? ignorespaces id$=$1622737.1622744

http://dl.acm.org/citation.cfm? ignorespaces id$=$1622737.1622744

http://ibmcloud.itosolutions.net


George A. Miller. 1995. WordNet: A lexical database for english. Communications of the ACM 38, 11 (1995),39–41.

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking:Bringing order to the web. Technical Report.

Tan Pang-Ning, Michael Steinbach, Vipin Kumar, and others. 2006. Introduction to Data Mining. Addison-Wesley.

Debprakash Patnaik, Manish Marwah, Ratnesh Sharma, and Naren Ramakrishnan. 2009. Sustainableoperation and management of data center chillers using temporal data mining. In ACM SIGKDD Inter-national Conference on Knowledge Discovery and Data Mining. ACM, 1305–1314.

Wei Peng, Tong Sun, Shriram Revankar, and Tao Li. 2012. Mining “the voice of the customer for business”prioritization. ACM Transactions on Intelligent Systems and Technology 3, 2, Article 38 (Feb. 2012), 17pages. DOI:http://dx.doi.org/10.1145/2089094.2089114

John Ross Quinlan. 1986. Induction of decision trees. Machine Learning 1 (1986), 81–106.Dumitru Roman, Uwe Keller, Holger Lausen, Jos de Bruijn, Rubn Lara, Michael Stollberg, Axel Polleres,

Cristina Feier, Christoph Bussler, and Dieter Fensel. 2005. Web service modeling ontology. In AppliedOntology.

Abraham Sebastian, Natalya Fridman Noy, Tania Tudorache, and Mark Musen. 2008. A generic ontology forcollaborative ontology development workflows. In Knowledge Engineering and Knowledge Management.

Burr Settles. 2009. Active Learning Literature Survey. Technical Report. University of Wisconsin-Madison.Yizhou Sun and Jiawei Han. 2013. Mining heterogeneous information networks: A structural analysis

approach. ACM SIGKDD Explorations Newsletter 14, 2 (2013), 20–28.Bin Tan, Atulya Velivelli, Hui Fang, and Chengxiang Zhai. 2007. Term feedback for information retrieval

with language models. In Special Interest Group on Information Retrieval (SIGIR’07).Hanghang Tong, Christos Faloutsos, and Jia-yu Pan. 2006. Fast random walk with restart and its applica-

tions. In Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM’06). 613–622.Jacopo Urbani, Spyros Kotoulas, Jason Maassen, Frank Van Harmelen, and Henri Bal. 2012. WebPIE: A

web-scale parallel inference engine using MapReduce. Web Semantics: Science, Services and Agents onthe World Wide Web 10 (2012), 59–75.

Yang Zhou and Ling Liu. 2011. Clustering social networks with entity and link heterogeneity. TechnicalReport.

Yang Zhou, Ling Liu, Chang-Shing Perng, Anca Sailer, Ignacio Silva-Lepe, and Zhiyuan Su. 2013. Rankingservices by service network structure and service attributes. In IEEE 20th International Conference onWeb Services. IEEE, 26–33.

Shenghuo Zhu, Tao Li, Zhiyuan Chen, Dingding Wang, and Yihong Gong. 2008. Dynamic active prob-ing of helpdesk databases. Proceedings of Very Large Data Bases Endowment 1, 1 (Aug. 2008), 13.DOI:http://dx.doi.org/10.1145/1453856.1453937

Lei Zou, Jinghui Mo, Lei Chen, M. Tamer Ozsu, and Dongyan Zhao. 2011. gStore: Answering SPARQL queriesvia subgraph matching. Proceedings of the Very Large Data Bases Endowment 4, 8 (2011), 482–493.

Received January 2015; revised July 2015; accepted February 2016


http://dx.doi.org/10.1145/2089094.2089114

http://dx.doi.org/10.1145/1453856.1453937

csm: a cloud service marketplace for complex service acquisition · 2017. 7. 26. · 8 csm: a cloud...

Documents