data svcs

Upload: sufisalmans

Post on 04-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 data svcs

    1/12

    Data Services

    Michael J. CareyUniversity of California, Irvine2091 Donald Bren Hall (DBH)

    Irvine, CA [email protected]

    Nicola OnoseUniversity of California, Irvine2091 Donald Bren Hall (DBH)

    Irvine, CA [email protected]

    Michalis PetropoulosUniversity of California, San

    Diego

    9500 Gilman DriveSan Diego, CA [email protected]

    ABSTRACT

    Data services provide access to data drawn from one or moreunderlying information sources. Compared to traditionalservices, data services are model-based, providing richer andmore data-centric views of the data they serve; they oftenprovide both query-based and function-based access to data.In the enterprise world, data services play an important rolein SOA architectures. They are also important when an

    enterprise wishes to controllably share data with its busi-ness partners via the Internet. In the emerging world ofsoftware-as-a-service (SaaS), data services enable enterprisesto access their own externally stored information. Last butnot least, with the emergence of cloud data management,a new class of data services is appearing. This paper aimsto help readers make sense of these technology trends by re-viewing data service concepts and examining approaches toservice-enabling individual data sources, creating integrateddata services from multiple sources, and providing access todata in the cloud.

    1. INTRODUCTION

    Todays business practices require access to enterprise databy both external as well as internal applications. Suppli-ers expose part of their inventory to retailers, and healthproviders allow patients to view their health records. Buthow do enterprise data owners make sure that access to theirdata is appropriately restricted and has predictable impacton their infrastructure? In turn, application developers sit-ting on the wrong side of the Web from their data needa mechanism to find out which data they can access, whattheir semantics are, and how they can integrate data frommultiple enterprises. Data services are software componentsthat address these issues by providing rich metadata, ex-pressive languages, and APIs for service consumers to useto send queries and receive data from service providers.

    Data services are a specialization of Web services that canbe deployed on top of data stores, other services, and/or ap-

    plications to encapsulate a wide range of data-centric oper-ations. In contrast to traditional Web services, services thatprovide access to data need to be model-driven, offering a se-mantically richer view of their underlying data and enablingadvanced querying functionality. Data service consumersneed access to enhanced metadata, which are both machine-readable and human-readable, such as schema information.This metadata is needed to use or integrate entities returnedby different data services. For example, a getCustomers dataservice might retrieve customer entities, while getOrdersBy-CID retrieves orders given a customer id. In the absence of

    any schema information, the consumer is not sure if he/shecan compose these two services to retrieve the orders of cus-tomers. Moreover, consumers of data services can utilize aquery language and pass a query expression as a parameterto a data service; they can use metadata-guided knowledgeof data services and their relationships to formulate queriesand navigate between sets of entities.

    Modern data services are descendants of the stored pro-cedure facilities provided by relational database systems,

    which allow a set of SQL statements and control logic tobe parameterized, named, access-controlled, and then calledfrom applications that wish to have the procedure performits task(s) without their having to know the internal de-tails of the procedure [42]. The use of data services hasmany of the same access control and encapsulation motiva-tions. Data owners (database administrators) publish dataservices because they are able to address security concernsand protect sensitive data, predict the impact of the pub-lished data services on their resources, and safeguard andoptimize the performance of their servers. In traditional ITsettings, it is not uncommon for stored procedures to bethe only permitted data access method for these reasons, asletting applications submit arbitrary queries has long been

    dismissed as too permissive and unpredictable [12]. This iseven more of an issue in the world of data services, whereproviders of data services typically have less knowledge of(and much less control over) their client applications.

    We are now moving towards a hosted services world andnew ways of managing and exchanging data. The growingimportance of data services in this movement is evidencedby the number of contexts within which they have been uti-lized in recent years: data publishing, data exchange andintegration, service-oriented architectures (SOA), data as aservice (DaaS), and most recently, cloud computing. Lack ofstandards, though, has allowed different middleware vendorsto develop different models for representing and queryingtheir data services [14]. Consequently, federating databasesand integrating data in general may become an even bigger

    headache. This picture becomes still more complicated ifwe consider databases living in the cloud [3]. Integratingdata across the cloud will be all the more daunting becausethe models and interfaces there are even more diverse, andalso because common assumptions about data locality, con-sistency, and availability may no longer hold.

    The purpose of this article is threefold: i) to explain basicdata services architecture(s) and terminology (Section 2), ii)to present three popular contexts in which data services arebeing deployed today, along with one examplary commercial

  • 7/29/2019 data svcs

    2/12

    !"#$%&'#()*+,-".$)

    /%+0'+$)

    12,2)3+04'5+)

    !"#"$%$&'#"("#"$

    )'*+',#,$ !"#"$%$&'#"("#"$-.$

    /&012#345+61789:$

    6%#57"#$)

    89,+0#2:)*".+:)

    !"#"$8#3;'

  • 7/29/2019 data svcs

    3/12

    lational database that stores information about customers,the orders they have placed, and the items comprising theseorders. The owners map this internal relational schema toan external model that includes entity types Customer, Orderand Item. Foreign key constraints are mapped to navigationproperties of these entity types, as shown by the arrows inthe external model of Figure 2. Hence, given an Order entity,its Items and its Customer can be accessed.

    WCF Data Services implements the Open Data Proto-

    col (OData) [37], which is an effort to standardize the cre-ation of data services. OData uses the Entity Data Model(EDM) as the external model, which is an object-extendedE/R model with strong typing, identity, inheritance and sup-port for scalar and complex properties. EDM is part of Mi-crosofts Entity Framework, an object-relational mapping(ORM) framework [2], which provides a declarative map-ping language and utilities to generate EDM models andmappings given a relational database schema.

    Clients consuming OData services, such as those producedby the WCF Data Services framework, can retrieve a ServiceMetadata Document [37] describing the underlying EDMmodel by issuing the following request that will return theentity types and their relationships:

    http:///SalesDS.svc/$metadata

    OData services provide a uniform, URI-based queryinginterface and map CRUD (Create, Retrieve, Update, andDelete) operations to standard HTTP verbs [22]. Specifi-cally, to create, read, update or delete an entity, the clientneeds to submit an HTTP POST, GET, PUT or DELETErequest, respectively. For example, the following HTTPGET request will retrieve the open Orders and Items for theCustomer with cid equal to C43:

    http:///SalesDS.svc/Customers(C43)/Orders?$filter=status eq open&$expand=Items

    The second line of the above request selects the Customerentity with key C43 and navigates to its Order entities. The$filter construct on the third line selects only the open Or-der entities, and the $expand on the last line navigates andinlines the related Item entities within each Order. Otherquery constructs, such as $select, $orderby, $top and $skip,are also supported as part of a request.

    To provide interested readers with a bit more detail, agraphical representation of the external model (EDM) to-gether with examples of the OData data service metadataand an AtomPub response from an OData service call areavailable online in Appendix A.

    Driven by the mapping between the EDM model and theunderlying data stores relational model, the WCF Data Ser-vices framework automatically translates the above request

    into the query language of the store. In our example then,under the covers, the above request would be translated intothe following SQL query involving a left outer join:

    SELECT Order.*, Item.*FROM Customer JOIN Order LEFT OUTER JOIN ItemWHERE Customer.cid=C43 AND Order.status=open

    The WCF Data Services framework can also publish storedprocedures as part of an OData service, called Function Im-ports, which can be used for any of the CRUD operationsand are integrated in the same URI-based query interface.

    !"#$%&$'!()('*+,%-$&'

    !"#$%"&'()"$*+,-%$

    ./)$%0(1'

    2+3$1'

    +./.$%"&'()"$

    +./.$

    %"&'()"$

    01234$

    +./.$

    !"#$%'() *"+,-'.(+)

    40)$5%(6+0'

    789&"-(1'

    !()('

    *$%#"-$'

    789&"-(1'

    !()('

    *$%#"-$'

    56/"7&.826$

    93"&:$

    2+3$1'

    2(::"05'

    2+3$1'

    2(::"05'

    )(4$ ;/./"$

    ;

  • 7/29/2019 data svcs

    4/12

    ate, update, and delete functions; the read function returnsa stream of XML elements, one per row from the underlyingtable, if used in a query without additional filter predicates.A Web service-based data source is modeled as a functionwhose inputs and outputs correspond to the schema infor-mation found in the services WSDL description.

    For a given set of physical data services, the task of thedata service architect is then to construct an integratedmodel for consumption by client application developers, SOA

    applications, and/or end users. In the example shown inFigure 3, data from two different sources, a Web service anda relational database, are being combined to yield a singleview of customer data service that hides its integration de-tails from its consumers. In ODSI, the required mergingand mapping of lower-level data services to create higher-level services is achieved by function composition using theXQuery language [9]. (This is very much like defining andlayering views in a relational DBMS setting.) Because dataservice architects often prefer graphical tools to hand writingqueries, ODSI provides a graphical query editor to supportthis activity; Appendix B describes its usage for this usecase in a bit more detail for the interested reader.

    Stepping back, Figure 3 depicts a declarative approach

    to building data services that integrate and service-enabledisparate data sources. The external model is a functionalmodel based on XQuery functions. The approach is declar-ative because the integration logic is specified in a high-levellanguage the integration query is written in XQuery in thecase of ODSI. Because of this approach, suppose the result-ing function is subsequently called from a query such as thefollowing, which could either come from an application orfrom another data service defined on top of this one:

    for $cust in ics:getAllCustomers( )where $cust/State = Rhode Islandreturn $cust/Name

    In this case, the data services platform can see throughthe function definition and optimize the querys executionby fetching only Rhode Island customers from the relationaldata source and retrieving only the orders for those cus-tomers from the order management service to compute theanswer. This would not be p ossible if the integration logicwas instead encoded in a procedural language like Java or abusiness process language like BPEL [32]. Moreover, noticethat the query doesnt request all data for customers; in-stead, it only asks for their names. Because of this, anotheroptimization is possible: The engine can answer the query byfetching only the names of the Rhode Island customers fromthe relational source and altogether avoid any order manage-ment system calls. Again, this optimization is possible be-cause the data integration logic has been declaratively spec-ified. Finally, it is important to note that such function def-

    initions and query optimizations can be composed and laterdecomposed (respectively) through arbitrary layers of speci-fications. This is attractive because it makes an incremental(piecewise) data integration methodology possible, as wellas allowing for the creation of specialized data services fordifferent consumers, without implying runtime penalties dueto such layering when actually answering queries.

    In addition to exemplifying the integration side of dataservices, ODSI includes some interesting advanced data ser-vice modeling features. ODSI supports the creation andpublishing of collections of interrelated data services (a.k.a.

    !"#$%&'(")*+,-".) */&-.")0&1'".) *23)4567*)

    !"#$ %&'&($ )"*$ +$

    89) :;) ?) @)@)

    ,-%&./(0%$

    A',(B)5&+&)

    *"-CDE")FGH)

    89):;) ?)

    E'D/IJ/K)

    89)

    9)

    @)

    7,-")LM/-"..DC")2("-#)3&NK(&K")

    @) @)

    !"#$ %&'&($ )"*$ +$

    89) :;) @)

    O9) ?) @)@)

    ,-%&./(0%$

    Figure 4: Data Services and Cloud Storage

    dataspaces) [13]. Metadata about collections of publisheddata services is made available in ODSI through catalog dataservices. Also, in addition to supporting method calls fromclients, ODSI provides an optional generic query interfacethat authorized clients can use to submit ad hoc queriesthat range over the results of calls to one or more data ser-vice methods. Thus, ODSI offers a hybrid of the two dataservice consuming methods discussed in Section 2.

    To aid application developers, methods in ODSI are clas-sified by purpose. ODSI supports read, create, update, anddelete methods for operating on data instances. It also sup-ports relationship methods to navigate from one object in-stance (e.g., a customer) to related object instances (e.g.,complaints). ODSI methods are characterized as being ei-ther functions (side-effect free) or procedures (potentiallyside-effecting). Finally, a given methods visibility can bedesignated as being one of: accessible by outside applica-tions, useable only from within other data services, or pri-vate to one particular data service.

    5. CLOUD DATA SERVICES

    Previous sections described how an enterprise data sourceor an integrated set of data sources can be made available asservices. In this section we focus on a new class of data ser-vices designed for providing data management in the cloud.

    The cloud is quickly becoming a new universal platformfor data storage and management. By storing data in thecloud, application developers can enjoy a pay-as-you-go charg-ing model and delegate most administrative tasks to thecloud infrastructure, which in turn guarantees availabilityand near-infinite scalability. As depicted in Figure 4, clouddata services today offer various external data models andconsuming methods, ranging from key-value stores to sparsetables and all the way to RDBMSs in the cloud. In termsof consuming methods (see Figure 1), key-value stores of-fer a simple function-based interface, while sparse tables areaccessed via queries. RDBMSs allow both a functional in-

    terface, if data access is provided via stored procedures only,and a query-based interface, if the database may be querieddirectly. A recent and detailed survey on cloud storage tech-nologies [16] proposes a similar classification of cloud dataservices, but further differentiates sparse tables into docu-ment stores and extensible record stores. With respect tothe architecture in Figure 1, some of these services forgo themodel mapping layer, choosing instead to directly exposetheir underlying model to consuming applications.

    To illustrate the various types of cloud data services, wewill briefly examine the Amazon Web Services (AWS) [1]

  • 7/29/2019 data svcs

    5/12

    platform, as it has arguably pioneered the cloud data man-agement effort. Other IT companies are also switching tobuilding cloud data management frameworks, either for in-ternal applications (e.g., Yahoo!s PNUTS [20]) or to offer aspublicly available data services (e.g., Microsofts WCF dataservices, as made available in Windows Azure [34]).

    Key-Value Stores: The simplest kind of data storage ser-vices are key-value stores that offer atomic CRUD opera-

    tions for manipulating serialized data structures (objects,files, etc.) that are identifiable by a key.

    An example of a key-value store is Amazon S3 [1]. S3 pro-vides storage support for variable size data blocks, called ob-jects, uniquely identified by (developer assigned) keys. Datablocks reside in buckets, which can list their content andare also the unit of access control. Buckets are treated assubdomains of s3.amazonaws.com. (For instance, the objectcustomer01.dat in the bucket custorder can be accessed ashttp://custorder.s3.amazonaws.com/customer01.dat.)

    The most common operations in S3 are: create (and name) a bucket, write an object, by specifying its key, and optionally

    an access control list for that object,

    read an object, delete an object, list the keys contained in one of the buckets.

    Data blocks in S3 were designed to hold large objects(e.g. multimedia files), but they can potentially be used asdatabase pages storing (several) records. Brantner et al [11]started from this observation and analyzed the trade-offs inbuilding a database system over S3. However, recent indus-trial trends are favoring the incorporation of more DBMSfunctionality directly into the cloud infrastructure, thus of-fering a higher level interface to the application.

    Dynamo [21] is another well-known Amazon example ofa key-value store. It differs in granularity from S3 sinceit stores only objects of a relatively small size (< 1 MB),whereas data blocks in S3 may go up to 5 GB.

    Sparse Tables: Sparse Tables are a new paradigm of stor-age management for structured and semi-structured datathat has emerged in recent years, especially after the inter-est generated by Googles Bigtable [18]. (Bigtable is thestorage system behind many of Googles applications and isexposed, via APIs, to Google App Engine [25] developers.)A sparse table is a collection of data records, each one havinga row and a set of column identifiers, so that at the logicallevel records behave like the rows of a table. There maybe little or no overlap between the columns used in differ-ent rows, hence the sparsity. The ability to address databased on (potentially many) columns differentiates sparsetables from key/value stores and makes it possible to indexand query data more meaningfully. Compared to traditionalRDBMSs that require static (fixed) schemas, sparse tableshave a more flexible data model, since the set of columnsidentifiers may change based on data updates. Bigtable hasinspired the creation of similar open source systems such asHBase [29] and Cassandra [15].

    SimpleDB [1] is Amazons version of a sparse table and ex-poses a Web service interface for basic indexing and queryingin the cloud. A column value in a SimpleDB table may beatomic, as in the relational model, or a list of atomic values(limited in size to 1KB). SimpleDBs tables are called do-

    mains. SimpleDB queries have a SQL-like syntax and canperform selections, projections and sorting over domains.There is no support for joins or nested subqueries.

    A SimpleDB application stores its customer informationin a domain called Customers and its order information inan Orders domain. Using SimpleDBs REST1 interface, theapplication can insert records (id=C043, state=NY) intoCustomers and (id=O012, cid=C043, status=open ) intoOrders. Further inserts do not necessarily need to conform

    to these schemas, but for the sake of our example we willassume that they do.

    Since SimpleDB does not implement joins, joins must becoded at the client application level. For example, to retrievethe orders for all NY clients, an application would first fetchthe client info via the query:

    select id from Customers where state=NY

    the result of which would include C043 and would then re-trieve the corresponding orders as follows:

    select * from Orders where cid=C043

    A major limitation for SimpleDB is that the size of a ta-

    ble instance is bounded. An application that manipulates alarge volume of data needs to manually partition (shard)it and issue separate queries against each of the partitions.

    RDBMSs: In cloud computing systems that provide a vir-tual machine interface, such as EC2 [1], users can install anentire database system in the cloud. However, there is alsoa push towards providing a database management systemitself as a service. In that case, administrative tasks suchas installing and updating DBMS software and performingbackups are delegated to the cloud service provider.

    Amazon RDS [1] is a cloud data service that providesaccess to the full capabilities of a MySQL [39] database in-stalled on a machine in the cloud, with the possibility ofsetting several read replicas for read-intensive workloads.

    Users can create new databases from scratch or migrate theirexistent MySQL data into the Amazon cloud. Microsoft hasa similar offering with SQL Azure [7], but chooses a differentstrategy that supports scaling by physically partitioning andreplicating logical database instances on several machines.A SQL Azure source can be service-enabled by publishingan OData service on top of it, as in Section 3. GooglesMegastore [5] is also designed to provide scalable and reli-able storage for cloud applications, while allowing users tomodel their data in a SQL-like schema language. Data typescan be string, numeric types, or Protocol Buffers [26] andthey can be required, optional or repeated.

    Amazon RDS users manage and interact with their data-bases either via shell scripts or a SOAP2-based Web servicesAPI. In both cases, in order to connect to a MySQL instance,users need to know its DNS name, which is a subdomain ofrds.amazonaws.com. They can then either open a MySQLconsole using an Amazon-provided shell script, or they canaccess the database like any MySQL instance identified bya DNS name and port.

    6. ADVANCED TECHNICAL ISSUES

    1Representational State Transfer [22]2Simple Object Access Protocol [36]

  • 7/29/2019 data svcs

    6/12

    So far we have mostly covered the basics of data services,touching on a range of use cases (single source, integratedsource, and cloud sources) along with their associated dataservice technologies. In this section we will briefly highlighta few more advanced topics and issues, including updatesand transactions, data consistency for scalable services, andissues related to security for data services.

    Data Service Updates and Transactions: As with other

    applications, applications built over data services requiretransactional properties in order to operate correctly in thepresence of concurrent operations, exceptions, and servicefailures. Data services based on single sources, for the mostpart, can inherit their answer to this requirement from thesource that they serve their data from. Data services thatintegrate data from multiple sources, however, face addi-tional challenges especially since many interesting datasources, such as enterprise Web services and cloud data ser-vices, are either unable or unwilling to participate in tra-ditional (two-phase commit-based) distributed transactionsdue to issues related to high latencies and/or temporary lossof autonomy. Data service update operations that involvenon-transactional sources can potentially be supported us-

    ing a compensation-based transaction model [8] based onSagas [23]. The classic compensating transaction exampleis travel-related, where a booking transaction might need toperform updates against multiple autonomous ticketing ser-vices (to obtain airline, hotel, rental car, and concert reser-vations) and roll them all back via compensation in the eventthat reservations cannot be obtained from all of them. Un-fortunately, such support is under-developed in current dataservice offerings, so this is an area where all current systemsfall short and further refinement is required. The currentstate of the art leaves too much to the application devel-oper in terms of hand-coding compensation logic as well aspicking up the pieces after non-atomic failures.

    Another challenge, faced both by single-source and multi-source data services, is the mapping of updates made to

    the external model to correspondingly required updates tothe underlying data source(s). This challenge arises becausedata services that involve non-trivial mappings as mightbe built using the tools provided by WCF or ODSI presentthe service consumer with an interface that differs from thoseof the underlying sources. The data service may restructurethe format of the data, it may restrict what parts of thedata are visible to users, and/or it may integrate data com-ing from several back-end data sources. Propagating dataservice updates to the appropriate source(s) can be handledfor some of the common cases by analyzing the lineage of thepublished data, i.e., computing the inverse mapping fromthe service view back to the underlying data sources basedon the service view definition [2, 8]. In some cases this isnot possible, either due to issues similar to non-updatibilityof relational views [6, 33] or due to the presence of opaquefunctional data sources such as Web service calls, in whichcase hints or manual coding would be required for a data ser-vices platform to know how to back-map any relevant datachanges.

    Data Consistency in the Cloud: To provide scalabilityand availability guarantees when running over large clusters,cloud data services have generally adopted lower consistencymodels. This choice is defended in [30], which states that in

    large scale distributed applications the scope of an atomicupdate needs to beand generally iswithin an abstrac-tion called entity. An entity is a collection of data with aunique key that lives on one machine, such as a data recordin Dynamo. According to [30], developers of truly scalableapplications have no real choice but to cope with the lack oftransactional guarantees across machines and with repeatedmessages sent between entities. In practice, there are severalconsistency models that share this philosophy.

    The simplest model is eventual consistency [46], first de-fined in [45] and used in Amazon Dynamo [21], which onlyguarantees that all updates will reach all replicas eventu-ally. There are some challenges with this approach, mainlybecause Dynamo uses replication to increase availability. Inthe case of network or server failures, concurrent updatesmay lead to update conflicts. To allow for more flexibility,Dynamo pushes conflict resolution to the application. Othersystems try to simplify developers lives and resolve conflictsinside the data store, based on simple policies: e.g., in S3the write with the latest timestamp wins.

    PNUTS, Yahoos sparse table store, goes one step beyondeventual consistency by also guaranteeing that all replicasof a given record apply all updates in the same order. Such

    stronger guarantees, called timeline consistency, may be nec-essary for applications that manipulate user data, e.g., if auser changes access control rights for a published data col-lection and then adds sensitive information. Amazon Sim-pleDB has recently added support for similar guarantees,which they call consistent reads, as well as for conditionalupdates, which are executed only if the current value of anattribute of a record has a specified expected value. Condi-tional update primitives make it easier to implement solu-tions for common use cases such as global counters or op-timistic concurrency control (by maintaining a timestampattribute).

    Finally, RDBMSs in the cloud (Megastore, SQL Azure)provide ACID semantics under the restriction that a trans-action may touch only one entity. This is ensured by requir-

    ing all tables involved in a transaction to share the samepartitioning key. In addition, Megastore provides supportfor transactional messaging between entities via queues andfor explicit two-phase commit.

    Data Services Security: A key aspect of data servicesthat is underdeveloped in current product and service offer-ings, yet extremely important, is data security. Web servicesecurity alone is not sufficient, as control over who can in-voke which service calls is just one aspect of the problemfor data services. Given a collection of data services, andthe data over which they are built, a data service architectneeds to be able to define access control policies that gov-ern which users can do and/or see what and from whichdata services. As an example, an interesting aspect of ODSIis support for fine-grained control over the information dis-closed by data service calls where the same service call,depending on whos asking, can return more or less infor-mation to the caller. Portions of the information returnedby a data service call can be encrypted, substituted, or alto-gether elided (schema permitting) from the calls results [10].More broadly, much work has been done in the areas of ac-cess control, security, and privacy for databases, and muchof it applies to data services. These topics are simply toolarge [4] to cover in the scope of this article.

  • 7/29/2019 data svcs

    7/12

    7. EMERGING TRENDS

    In this article we have taken a broad look at work inthe area of data services. We looked first at the enterprise,where we saw how data services can provide a data-orientedencapsulation of data as services in enterprise IT settings.We examined concepts, issues, and example products relatedto service-enabling single data sources as well as related tothe creation of services that provide an integrated, service-oriented view of data drawn from multiple enterprise datasources. Given that clouds are rapidly forming on the IThorizon, both for Web companies and for traditional enter-prises, we also looked at the emerging classes of data servicesthat are being offered for data management in the cloud. Asthe latter mature, we expect to see a convergence of every-thing that we have looked at, as it seems likely that rich dataservices of the future will often be fronting data residing inone or more data sources in the cloud.

    To wrap up this article, we briefly list a handful of emerg-ing trends that can possibly direct future data services re-search and development. Some of the trends listed stemfrom already existing problems, while others are more pre-dictive in nature. We chose this list, which is necessarilyincomplete, based on the evolution of data services that we

    have witnessed while slowly authoring this report over thetwo last years. Again, while data services were initially con-ceived to solve problems in the enterprise world, the cloudis now making data services accessible to a much broaderrange of consumers; new issues will surely arise as a result.

    Query Formulation Tools: Service-enabled data sourcessometimes support (or permit) only a restricted set of queriesagainst their schemas. Users trying to formulate a queryover multiple such sources can have a hard time figuringout how to compose such data services. For a schema-basedexternal model, recent work proposed tools to help users au-thor only answerable queries, e.g., CLIDE [41]. These toolsutilize the schemas and restrictions of the service-enableddata sources as a basis for query formulation, rather than

    just the externally visible data service metadata, to guidethe users towards formulating answerable queries. Morework is needed here to handle broader classes of queries.Data Service Query Optimization: In the case of inte-grated data services with a functional external model, onecould imagine defining a set of semantic equivalence rulesthat would allow a query processor to substitute a data ser-vice call used in a query for another service call in order tooptimize the query execution time, thus enabling semanticdata service optimization. For example, the following equiv-alence rule captures the semantic equivalence of the dataservices getOrderHistory and getOpenOrders when a [status= open] condition is applied to the former:

    getOrderHistory(cid) [status = open] getOpenOrders(cid)

    Work is needed here to help data service architects tospecify such rules and their associated tradeoffs easily andto teach query optimizers to exploit them.

    Very Large Functional Models: For data services usinga functional external model, if the number of functions isvery large, it is hard or even impossible for the data ownerto explicitly enumerate all functions and for the query devel-oper to have a global picture of them. Consider the exampleof a data owner, who, for performance reasons, only wants

    to allow queries that use some non-empty subset of a setofn filter predicates. Enumerating all the 2n 1 combina-tions as functions would be tedious and impractical for largen. Recent work has studied how models consisting of suchlarge collections of functions, where the function bodies aredefined by XPath queries, can be compactly specified us-ing a grammar-like formalism [40] and how queries over theoutput schema of such a service can be answered using themodel [17]. More work is needed here to extend the formal-

    ism and the query answering algorithms to larger classes ofqueries and to support functions that perform updates.

    Cloud Data Service Integration: Since consumers andsmall businesses are starting to store their data in the cloud,it makes sense to think about data service integration in thecloud when there are many, many small data services avail-able. For example, how can Clark Cloudlover integrate hisGoogle calendar with his wifes Apple iCal calendar? Atthis point, there is no homogeneity in data representationand querying, and the number of cloud service providers israpidly increasing. Google Fusion Tables [27] is one exampleof a system that follows this trend and allows its users toupload tabular data sets (spreadsheets), to store them in the

    cloud, and subsequently to integrate and query them. Usersare able to integrate their data with other publicly avail-able datasets by performing left outer joins on primary keys,called table merges. Fusion Tables also visualize users datausing maps, graphs and other techniques. Work is neededon many aspects of cloud data sharing and integration.

    Data Summaries: As the number of data services in-creases to a consumer scale, it will be hard even to findthe data services of interest and to differentiate among dataservices whose output schemas are similar. One approach toeasing this problem is to offer data summaries that can besearched and that can give data service consumers an ideaof what lies behind a given data service. Data sampling andsummarization techniques that have been traditionally em-

    ployed for query optimization can serve as a basis for workon large-scale data service characterization and discovery.

    Cloud Data Service Security: Storing proprietary orconfidential data in the cloud obviously creates new secu-rity problems. Currently, there are two broad choices. Dataowners can either encrypt their data, but this means thatall but exact-match queries have to be processed on theclient, moving large volumes of data across the cloud, orthey have to trust cloud providers with their data, hopingthat there are enough security mechanisms in the cloud toguard against malicious applications and services that mighttry to access data that does not belong to them. There isearly ongoing work [24] that may help to bridge this gapby enabling queries and updates over encrypted data, butmuch more work is needed to see if practical (e.g., efficient)approaches and techniques can indeed be developed.

    8. ACKNOWLEDGMENTS

    We would to thank Divyakant Agrawal (UC Santa Bar-bara), Pablo Castro (Microsoft), Alon Halevy (Google), JamesHamilton (Amazon) and Joshua Spiegel (Oracle) for theirdetailed comments and the time they devoted to carefullyread an earlier version of this article. We would also like tothank the handling editor and the anonymous reviewers for

  • 7/29/2019 data svcs

    8/12

    feedback that improved the quality of this article.This work was supported in part by NSF IIS awards 0910989,

    0713672 and 1018961.

    9. REFERENCES

    [1] Amazon Web Services, 2010. http://aws.amazon.com/.

    [2] A. Adya, J. A. Blakeley, S. Melnik, and S. Muralidhar.Anatomy of the ADO.NET entity framework. In

    SIGMOD Conference, pages 877888, 2007.[3] D. Agrawal, A. E. Abbadi, S. Antony, and S. Das.

    Data Management Challenges in Cloud ComputingInfrastructures. In DNIS, pages 110, 2010.

    [4] Allen A. Friedman and Darrell M. West. Privacy andSecurity in Cloud Computing. Issues in TechnologyInnovation, 3, October 2010. Center for TechnologyInnovation at Brookings.

    [5] J. Baker, C. Bond, J. C. Corbett, J. Furman,A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd,and V. Yushprakh. Megastore: Providing Scalable,Highly Available Storage for Interactive Services. InCIDR Conference, 2011.

    [6] F. Bancilhon and N. Spyratos. Update semantics of

    relational views. ACM Trans. Database Syst.,6:557575, December 1981.

    [7] P. A. Bernstein, I. Cseri, N. Dani, N. Ellis, A. Kalhan,G. Kakivaya, D. B. Lomet, R. Manne, L. Novik, andT. Talius. Adapting Microsoft SQL Server for cloudcomputing. In ICDE, pages 12551263, 2011.

    [8] M. Blow, V. Borkar, M. Carey, C. Hillery,A. Kotopoulis, D. Lychagin, R. Preotiuc-Pietro,P. Reveliotis, J. Spiegel, and T. Westmann. Updatesin the AquaLogic Data Services Platform. In IEEEInternational Conference on Data Engineering(ICDE), pages 14311442, 2009.

    [9] S. Boag, D. Chamberlin, M. F. Fernandez,D. Florescu, J. Robie, and J. Simeon. XQuery 1.0: An

    XML Query Language. W3C Recommendation 23January 2007, 2007. http://www.w3.org/TR/xquery/.

    [10] V. R. Borkar, M. J. Carey, D. Engovatov,D. Lychagin, P. Reveliotis, J. Spiegel, S. Thatte, andT. Westmann. Access Control in the AquaLogic DataServices Platform. In SIGMOD Conference, pages939946, 2009.

    [11] M. Brantner, D. Florescu, D. A. Graf, D. Kossmann,and T. Kraska. Building a database on S3. InSIGMOD Conference, pages 251264, 2008.

    [12] Britton-Lee Inc. IDM 500 Software Reference ManualVersion 1.3, 1981.

    [13] M. J. Carey. Data Delivery in a Service-OrientedWorld: The BEA AquaLogic Data Services Platform.In SIGMOD Conference, pages 695705, 2006.

    [14] M. J. Carey. SOA What? IEEE Computer,41(3):9294, 2008.

    [15] The Apache Cassandra Project, 2009.http://cassandra.apache.org/.

    [16] R. Cattell. Scalable SQL and NoSQL data stores.SIGMOD Record, 39(4):1227, 2010.

    [17] B. Cautis, A. Deutsch, N. Onose, and V. Vassalos.Efficient rewriting of XPath queries using query setspecifications. PVLDB, 2(1):301312, 2009.

    [18] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A.

    Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E.Gruber. Bigtable: a distributed storage system forstructured data. In OSDI 06: Proceedings of the 7thUSENIX Symposium on Operating Systems Designand Implementation, pages 1515, 2006.

    [19] E. Christensen, F. Curbera, G. Meredith, andS. Weerawarana. Web Services Description Language(WSDL) 1.1. W3C Note 15 March 2001, 2001.http://www.w3.org/TR/wsdl.

    [20] B. F. Cooper, R. Ramakrishnan, U. Srivastava,A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz,D. Weaver, and R. Yerneni. PNUTS: Yahoo!s hosteddata serving platform. Proc. VLDB Endow.,1(2):12771288, 2008.

    [21] G. DeCandia, D. Hastorun, M. Jampani,G. Kakulapati, A. Lakshman, A. Pilchin,S. Sivasubramanian, P. Vosshall, and W. Vogels.Dynamo: Amazons highly available key-value store.In SOSP, pages 205220, 2007.

    [22] R. T. Fielding. Architectural Styles and the Design ofNetwork-based Software Architectures. PhD thesis,University of California, Irvine, 2000.

    [23] H. Garcia-Molina and K. Salem. Sagas. SIGMOD

    Rec., 16(3):249259, 1987.[24] C. Gentry. Computing arbitrary functions of

    encrypted data. Commun. ACM, 53(3):97105, 2010.

    [25] Google. Google App Engine.http://code.google.com/appengine.

    [26] Google. Protocol Buffers: Googles data interchangeformat, 2008. http://code.google.com/p/protobuf.

    [27] Google. Fusion Tables, 2009.http://tables.googlelabs.com.

    [28] J. Gregorio and B. de hOra. The Atom PublishingProtocol, 2005.http://datatracker.ietf.org/doc/rfc5023/.

    [29] HBase, 2010. http://hbase.apache.org/.

    [30] P. Helland. Life Beyond Distributed Transactions, an

    Apostates Opinion. In Conference on Innovative DataSystem Research (CIDR), 2007.

    [31] JavaScript Object Notation. http://www.json.org/.

    [32] D. Jordan and J. Evdemon. Web Services BusinessProcess Execution Language Version 2.0. OASISStandard, 11 April 2007, 2007. http://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.html.

    [33] Y. Kotidis, D. Srivastava, and Y. Velegrakis. Updatesthrough views: A new hope. In ICDE, page 2, 2006.

    [34] Microsoft, Inc. Windows Azure, 2009.http://www.microsoft.com/windowsazure/windowsazure/.

    [35] Microsoft, Inc. WCF Data Services, 2011.http://msdn.microsoft.com/en-us/data/bb931106.

    [36] N. Mitra and Y. Lafon. SOAP Version 1.2 Part 0:

    Primer (Second Edition). W3C Recommendation 27April 2007, 2007. http://www.w3.org/TR/soap/.

    [37] OData. Open Data Protocol, 2010.http://www.odata.org/.

    [38] Oracle and/or its affiliates. Oracle Data ServiceIntegrator 10gR3 (10.3), 2011.http://download.oracle.com/docs/cd/E13162 01/odsi/docs10gr3.

    [39] Oracle Corporation and/or its affiliates. MySQL.http://www.mysql.com/.

  • 7/29/2019 data svcs

    9/12

  • 7/29/2019 data svcs

    10/12

    Figure 5: An Entity Data Model

    !""#$%&'#()

    *+,&-&,&)

    .(',/)

    *+,&-&,&)

    ) ) )

    Figure 6: An OData Service Metadata Document

  • 7/29/2019 data svcs

    11/12

    )

    )

    01-+1)-&,&)!)

    !"#$%&))

    2,+3)-&,&)

    '()*%)+,"

    Figure 7: An OData Service Response Document

  • 7/29/2019 data svcs

    12/12

    !"#$"%

    &'(')$*$(+%

    ,$-%.$"/01$%

    234+5*$"%

    6$7'85('7%

    9'+'%.$"/01$%

    :(+$)"'+$#%

    9'+'%.$"/01$%

    Figure 8: Integrating RDBMS and Web Service Data Sources