incremental maintenance of aggregate and outerjoin expressions

26

Upload: nguyenkiet

Post on 11-Feb-2017

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Incremental Maintenance of Aggregate and Outerjoin Expressions

Incremental Maintenance of Aggregate and OuterjoinExpressionsHimanshu Gupta �Stanford [email protected] Inderpal Singh MumickSavera [email protected] Number AMERICA82

AbstractViews stored in a data warehouse need to be kept current. As recomputing the views is veryexpensive, incremental maintenance algorithms are required. Over recent years, several incre-mental maintenance algorithms have been proposed. None of the proposed algorithms handlethe case of general relational expressions involving aggregate and/or outerjoin operators. More-over, in most of the maintenance algorithms proposed, changes are propagated as insertions anddeletions.In this article, we develop a framework of change-tables and a refresh operator for incre-mentally maintaining general view expressions involving relational and aggregate operators. Weshow that the presented techniques outperform the previously proposed techniques by ordersof magnitude. The developed framework easily extends to e�ciently maintaining view expres-sions containing outerjoin operators. Moreover, the framework also yields an approach to allowpropagation of certain kinds of deletions and updates directly in a very e�cient manner.1 IntroductionIn a data warehouse, views are computed and stored in the database to allow e�cient queryingand analysis of the data. These views stored at the data warehouse are known as materializedviews. In order to keep the views in the data warehouse up to date, it is necessary to maintain thematerialized views in response to the changes at the sources. The view can be either recomputedfrom scratch, or incrementally maintained by propagating the base data changes onto the view sothat the view re ects the changes. Incrementally maintaining a view can be signi�cantly cheaperthan recomputing the view from scratch, especially if the size of the view is large compared to thesize of the changes [BM90, MQM97, CKL+97].The problem of �nding such changes at the views based on changes to the base relations hascome to be known as the view maintenance problem and has been studied extensively. Severalalgorithms have been proposed over the recent years [BLT86, BCL89, CW91, QW91, GMS93,GL95, Qua97, MQM97, GJM97] for incremental maintenance of view expressions. The previouslyproposed algorithms on incremental maintenance su�er from the following shortcomings:�Contact Author. Gates Computer Science Building, Wing 4B, Stanford CA 94305. Tel: (650) 723-9445, Fax:(650) 725-2588. Supported by NSF grant IRI-96-31952. Most of the work done while at ATT Research Labs, FlorhamPark, NJ. 1

Page 2: Incremental Maintenance of Aggregate and Outerjoin Expressions

� None of the earlier work handles the case of general view expressions involving aggregate andouterjoin operators. Quass in [Qua97] is the only work that attempts to maintain general viewexpressions involving aggregate operators, but the expressions obtained are very ine�cientand complicated. Gupta et al. in [GJM97] show how to maintain a simple outerjoin view,but do not address general expressions involving outerjoin operators.1� To date, most of the incremental maintenance approaches compute and propagate insertionsand deletions at each node in a view expression tree, which could be very ine�cient in casesinvolving aggregations or outerjoins.Paper Organization In the rest of this section, we present some basic notation used throughoutthis article. Section 2 presents a motivating example that illustrates the idea behind this paper andcontrasts previous techniques with techniques presented in this article. In Section 3, we brie ydescribe how our work �ts in the previous frameworks of incremental view maintenance algorithms.In Section 4, we de�ne the refresh operator used to apply the changes represented as a changetable and brie y outline its implementation. In the following section, we discuss propagation ofaggregate-change tables, i.e., change tables that originate at an aggregate operator. In Section 6, wediscuss propagation of change tables that originate at an outerjoin node. We discuss the optimalityof our techniques under some reasonable cost model in Section 7. A brief survey of related work ispresented in Section 8. Finally, we present our concluding remarks in Section 9. The full version[GM99] of the paper contains proofs to the stated theorems and the extensions of the techniquesto propagate certain kinds of deletions and updates directly and more e�ciently.1.1 NotationWe consider only bag semantics in this article, i.e., all the relational operators used are duplicate-preserving. We use ] to denote bag union, �� to denote monus (bag minus), 5E to denotedeletions from a bag-algebra expression E,4E to denote insertions into E, �p to denote selection oncondition p, �A to denote duplicate-preserving projection on a set of attributes A, � to denote thegeneralized projection operator (note that we use slightly di�erent symbols for duplicate-preservingprojection (�) and for generalized projection (�) operators), � to denote cross-product, 1 todenote natural join, and 1J and fo./J to denote join and full outerjoin operations with the joincondition J . The symbols lo./ and ro./ are used for left and right outerjoin respectively. Also,Attrs(J) denotes the set of attributes used in a predicate J or a relation J .The only operators that may require explanation are the outerjoin and generalized projection op-erators. The (full) outerjoin di�ers from an ordinary join by including in the result any \dangling"2tuple of either relation after \padding" it with NULL's in those attributes that belong to the other re-lation. For example, R(A;B) fo./R:B=S:B S(B;C) will include a tuple (a; b; NULL; NULL), if (a; b) 2 Rand (b; c) =2 S for any c. One variant of the outerjoin operator is a left (right) outerjoin, wherethe dangling tuples of only the left (right) operand relation are padded with NULL's and included1In recent work done concurrently with ours, Gri�n and Kumar [GK98] derives expressions for propagatinginsertions and deletions through outerjoin operators.2Dangling tuples are the ones that fail to join with any tuple from the other relation.2

Page 3: Incremental Maintenance of Aggregate and Outerjoin Expressions

in the result. Hence, in the above example, (a; b; NULL; NULL) would be included in R lo./J S, butnot in R ro./J S. The generalized projection operator introduced in [GHQ95] is used to algebraicallyrepresent the groupby operation of SQL. For example, we could use the following expression tode�ne the SISales view (V1) of Example 1 on the next page.V1 = �storeID;itemID;SumSISales=sum(price);NumSISales=count(�)(�date>1=1=95(sales))For a set of attributes G, we use the notation �G to represent the predicateVg2G(LHS:g = RHS:g)in a join condition, where LHS and RHS are the left and right operand relations of the joinoperator. For example, when J is (�G ^ p), the expression R 1JS denotes a join operation withthe join condition (Vg2G(R:g = S:g) ^ p), for a predicate p and a set of attributes G in R and S.2 Motivating Example and Previous ApproachesEXAMPLE 1 Consider the classic example of a warehouse containing information about stores,items, and day-to-day sales. The warehouse stores the information in three base relations viz.stores, items, and sales de�ned by their schemas as follows.stores(storeID, city, state)InfoStores(state, area, population)items(itemID, category, cost)sales(storeID, itemID, date, price)For each store location, the relation stores contains the storeID, the city, and the state in whichthe store is located. The relation InfoStores contains more information about each state. For eachitem, the relation items contains its itemID, its category, and its purchase price (cost). An itemhas a unique cost, but can belong to multiple categories, e.g., a whitening toothpaste could belong todental care, cosmetics, and hygiene categories.3 The relation sales contains detailed informationabout sales transactions. For each item sold, the relation sales contains a tuple storing the storeIDof the selling store, itemID of the item sold, date of sale, and the sale price.Consider the following views SISales, CitySales, CategorySales, SSInfo, and SSFullInfode�ned over the base relations. The view SISales computes for each storeID and itemID the totalprice of items sold after 1/1/95. The view SISales is an intermediate table used to de�ne theviews CitySales and CategorySales. The view CitySales stores, for each city, the total numberand dollar value of sales of all the stores in the city. The view CategorySales stores the totalsale for each category of items. All the above described views consider only those sales that occurafter 1/1/95. The view SSInfo stores the full outerjoin of the base relations sales and stores,retaining stores that have had no sales due to some reasons and also retaining those sales whosecorresponding storeID is missing from the table stores, because, maybe, the table stores hasnot been updated yet. We de�ne another view SSFullInfo as the outerjoin view of SSInfoand InfoStores. The views CitySales, CategorySales, and SSFullInfo are the only views thatare stored (materialized) at the data warehouse and this is represented below by the keyword\MATERIALIZED"4 in the SQL de�nitions of the views. We wish to maintain these materializedviews in response to changes at the base relations.3Note that items is denormalized with a functional dependency from itemID to cost.4The keyword \MATERIALIZED" is not supported by SQL, but has been introduced in this article.3

Page 4: Incremental Maintenance of Aggregate and Outerjoin Expressions

CREATE VIEW SISales ASSELECT storeID, itemID, sum(price) AS SumSISales, count(�) AS NumSISalesFROM salesWHERE date > 1=1=95GROUP BY storeID, itemID;CREATE MATERIALIZED VIEW CitySales ASSELECT city, sum(SumSISales) AS SumCiSales, sum(NumSISales) AS NumCiSalesFROM SISales, storesWHERE SISales.storeID = stores.storeIDGROUP BY city;CREATE MATERIALIZED VIEW CategorySales ASSELECT category, sum(SumSISales) AS SumCaSales, sum(NumSISales) AS NumCaSalesFROM SISales, itemsWHERE SISales.itemID = items.itemIDGROUP BY category;CREATE VIEW SSInfo ASSELECT �FROM sales FULL OUTERJOIN storesWHERE sales.storeID = stores.storeID;CREATE MATERIALIZED VIEW SSFullInfo ASSELECT �FROM SSInfo FULL OUTERJOIN InfoStoresWHERE SSInfo.state = InfoStores.state;Consider the database sizes shown in Table 1. We assume that the base relation sales has onebillion sales transactions, and the base relations stores, InfoStores, and items have 1,000, 100,and 10,000 tuples respectively. We illustrate the various maintenance approaches for the casewhen 10,000 tuples are inserted into the base relation sales. Table 1 shows the number of tupleschanged in the views, as a result of the insertion of 10,000 tuples into sales. The table also showsthe number of tuple accesses (reads and writes) incurred by di�erent maintenance techniques toupdate the materialized views. We have used the simple model of counting tuple accesses for sakeof convenience as orders of magnitude improvement in number of tuples computed and accessedtranslates directly into signi�cant improvement in number of disk accesses. In Section 7, we showthat the presented techniques are superior to previous techniques under a data warehouse costmodel. We use the names V1; V2; V3; V4; and V5 for the views SISales, CitySales, CategorySales,SSInfo, and SSFullInfo respectively. We assume that the relations stores, InfoStores, anditems are small enough to �t in main-memory.Previous TechniquesOf the previous approaches, only [Qua97] provides techniques to maintain general view ex-pressions involving aggregate operators. Prior works in [GMS93], [GL95], and [MQM97] consideraggregates, but in a very limited fashion. Gupta et al. in [GMS93] consider view expressionsinvolving aggregates, but their maintenance expressions assume that the intermediate aggregate4

Page 5: Incremental Maintenance of Aggregate and Outerjoin Expressions

Summary Table Number of Changes Tuple Reads and WritesTuples (No. of Tuples) Previous Work Our Approachsales 1,000,000,000 10,000stores 1,000 -InfoStores 100 -items 10,000 -V1 = SISales 1,000,000 600 610,000 ([Qua97]) 10,000V2 = CitySales 100 10 1,020 ([Qua97]) 1,020V3 = CategorySales 1,000 1,000 12,000 ([Qua97]) 12,000Total for V1; V2; V3 623,020 [Qua97] 23,020V4 = SSInfo 1,000,000,010 10,000 2,000,000,020 [GJM97] 11,00011,000 [GK98]V5 = SSFullInfo 1,000,000,020 10,000 10,100 [GJM97]/[GL95] 20,1001,000,020,110 [GK98]Total for V4 and V5 1,000,031,110 [GK98] 31,100Table 1: Bene�ts of propagating change tables (Materialized views are V2; V3, and V5.)subexpressions are materialized. Modifying the techniques presented in [GMS93] to accommodatenon-materialized aggregate subexpressions yields an approach that is conceptually similar to thatof [Qua97]. The works of [GL95] and [MQM97] are restricted to views that have at most oneaggregate operator as the last operator in the view expression.5 Also, group-by attributes are notallowed in [GL95], and [MQM97] maintains views on star schemas only. Thus, for the case ofgeneral view expressions involving aggregate operators, we can compare our techniques with thatof [Qua97] only.The work in [Qua97] is an extension of the techniques in [GL95] to include aggregate operators.[Qua97] computes maintenance expressions for general views by recursively computing insertionsand deletions for each of the subexpressions in the view expression in response to changes at the baserelations. In our example, the insertions to sales,4sales, result in insertions (4V1) and deletions(5V1) to the view V1 = SISales, which is an aggregate view over the base relation sales. Theexpressions that compute4V1 and5V1, as derived in [Qua97], are quite complex (see Appendix A).As V1 is not materialized, the maintenance expressions for V1 essentially recompute the aggregatevalues of the a�ected tuples in V1 from the base relation sales. Using the propagation equationsfrom [Qua97], one can propagate 4V1 and 5V1 upwards to obtain expressions for 5V2;4V2;5V3;and 4V3. Appendix A shows the derived maintenance expressions and the computation of numberof tuples accessed.Our TechniquesThe approach proposed in this article is the following. Instead of computing and propagatinginsertions and deletions beyond an aggregate node V1, we compute and propagate a change tablefor V1. A change table is a general form of summary-delta tables introduced in [MQM97]. We use2V to denote the change table of a view V . We show that propagation of change tables yields very5In some cases, view expressions can be rewritten so that aggregation is the last operator, but the rewritten queryhas worse query performance. 5

Page 6: Incremental Maintenance of Aggregate and Outerjoin Expressions

sales TablestoreID itemID date priceNY1 I1 1/3/95 110NY2 I2 7/3/95 200NY2 I2 3/1/95 280SJ1 I1 1/1/95 100 Insertions into sales (4sales)storeID itemID date priceNY1 I1 1/4/95 70NY1 I1 1/5/95 50NY2 I2 7/3/95 50NY3 I3 1/1/92 100MA1 I4 7/3/95 50stores TablestoreID city stateNY1 New York NYSJ1 San Jose CANY3 Bu�alo NYMA1 Boston MA items TableitemID category costI1 C1 50I1 C2 50I2 C3 80I2 C1 80 SISales Change Table (2V1)storeID itemID SumSISales NumSISalesNY1 I1 120 2NY2 I2 50 1CategorySales (V3) Viewcategory SumCaSales NumCaSalesC1 690 4C2 210 2 tU�C3 480 2 CategorySales Change Table (2V3)category SumCaSales NumCaSalesC1 170 3C2 120 2C3 50 1= Refreshed CategorySales (V3) Viewcategory SumCaSales NumCaSalesC1 860 7C2 330 4C3 530 3Figure 1: Computing Change tables and refreshing the CategorySales (V3) Viewe�cient and simple maintenance expressions for general view expressions involving aggregate andouterjoin operators. The change table cannot be simply inserted or deleted from the materializedview. Rather, the change table must be applied to the materialized view using a special \refresh"operator. In Section 4, we de�ne a refresh operator that is used to refresh views using the changetables. The symbol tU� is used to denote the refresh operator, where � and U are its parametersspecifying join conditions and update functions respectively.For our example, we start with computing the change table 2V1 that summarizes the netchanges to V1. For this �rst level of aggregates, the expression that computes 2V1 is similar to thatderived in [MQM97].2V1 = �storeID;itemID;SumSISales=sum(price);NumSISales=sum( count)(�storeID;price; count=1(�p(4sales)) ]�storeID; price= �price; count= �1(�p(5sales))); where p is (date > 1=1=95)Figure 1 presents an instance of the base relation sales and the table 4sales, which is the set ofinsertions into sales. For the given tables, Figure 1 also shows the computed table 2V1. Next wepropagate the change table 2V1 upwards to derive expressions for the change tables 2V2 and 2V3.2V2 = �city;SumCiSales=sum(SumSISales);NumCiSales=sum(NumSISales)(2V1 1 stores)2V3 = �category;SumCaSales=sum(SumSISales);NumCaSales=sum(NumSISales)(2V1 1 items)6

Page 7: Incremental Maintenance of Aggregate and Outerjoin Expressions

InfoStores Viewstate Area Population(in thousand sq. miles) (in millions)NY 100 25CA 150 35SSFullInfo (V5) ViewstoreID itemID date price st.storeID st.city st.state InSt.state InSt.area InSt.popNY1 I1 1/3/95 110 NY1 New York NY NY 100 25NY2 I2 7/3/95 200 NULL NULL NULL NULL NULL NULLNY2 I2 3/1/95 280 NULL NULL NULL NULL NULL NULLSJ1 I1 1/1/95 100 SJ1 San Jose CA CA 150 35NULL NULL NULL NULL NY3 Bu�alo NY NY 100 25NULL NULL NULL NULL MA1 Boston MA NULL NULL NULLSSFullInfo Change Table (2V5 = (5sales lo./ stores) lo./ InfoStores)storeID itemID date price st.storeID st.city st.state InSt.state InSt.area InSt.popNY1 I1 1/4/95 70 NY1 New York NY NY 100 25NY1 I1 1/5/95 50 NY1 New York NY NY 100 25NY2 I2 7/3/95 50 NULL NULL NULL NULL NULL NULLNY3 I3 1/1/92 100 NY3 Bu�alo NY NY 100 25MA1 I4 7/3/95 50 MA1 Boston MA NULL NULL NULLRefreshed SSFullInfo (V5)storeID itemID date price st.storeID st.city st.state InSt.state InSt.area InSt.popNY1 I1 1/3/95 110 NY1 New York NY NY 100 25NY2 I2 7/3/95 200 NULL NULL NULL NULL NULL NULLNY2 I2 3/1/95 280 NULL NULL NULL NULL NULL NULLSJ1 I1 1/1/95 100 SJ1 San Jose CA CA 150 35NY3 I3 1/1/92 100 NY3 Bu�alo NY NY 100 25NY1 I1 1/4/95 70 NY1 New York NY NY 100 25NY1 I1 1/5/95 50 NY1 New York NY NY 100 25NY2 I2 7/3/95 50 NULL NULL NULL NULL NULL NULLMA1 I4 7/3/95 50 MA1 Boston MA NULL NULL NULLFigure 2: Computing change table and refreshing the outerjoin view expression SSFullInfo (V5)Figure 1 shows the change table 2V3 for the given instance of the base table items. The changetable 2V2 can be similarly computed. The new propagated change tables are then used to refreshtheir respective materialized views V2 and V3 using the refresh equations below (disregard � and Ufor now). The details of the refresh equations are given in Example 3. Note that V1 does not needto be refreshed. V2 = V2 tU2�2 2V2; and V3 = V3 tU3�3 2V3We illustrate the refresh operation by showing how the CategorySales view (V3) is refreshed.Figure 1 shows the materialized table V3 = CategorySales for the given instance of base tables.For each tuple 2v3 in 2V3, we look for a match in V3 using the join condition V3.category =7

Page 8: Incremental Maintenance of Aggregate and Outerjoin Expressions

2V3.category (speci�ed in �3). The tuple 2v3 = <C1,170,3> in 2V3 matches with the tuple v3 =<C1,690,4> of V3. The tuple <C1,170,3> in 2V3 means that three more sales totaling $170 haveoccurred for C1 category. The total number of sales for C1 is now 7 for a total amount of $860.To re ect the change, the tuple v is updated to <C1,860,7> by adding together the correspondingaggregated attributes (speci�ed in the U3 parameter of the refresh operator).To compute the number of tuple accesses in Table 1, note that most of the computation is donein computing 2V1, which requires 10,000 tuple accesses to read 4sales. Given the small sizes of2V1; items; and stores, the rest of the computation can be done in main memory and hence, thetotal number of tuples accesses is 10,000 (to read 4sales) + 11,000 (to read items and sales) +2,020 (to refresh V2 and V3) = 23,020, showing that our technique is very e�cient in comparisonto previous approaches.OuterjoinsThe change-table technique can also be used for maintenance of view expressions involvingouterjoin operators. Outerjoin views are supported by SQL and are commonly used in practice,such as for data integration [GJM96]. We illustrate our technique for maintaining outerjoin viewsby deriving maintenance expressions for outerjoin views V4 and V5. The net changes to V4, inresponse to insertions 4sales into sales, can be succinctly summarized in a change table 2V4.The change table 2V4 is computed and then propagated up as follows.2V4 = 4sales lo./sales:storeID=stores:storeID stores2V5 = 2V4 lo./stores:state=InfoStores:state InfoStoresV5 = V5 tU5�5 2V5Recall that lo./ denotes the left-outerjoin operator. Figure 2 shows the materialized view V5 andthe change tables 2V4 and 2V5 for the database instance of Figure 1. The refresh of V5 proceeds asfollows. Each tuple in 2V5 is matched with tuples in V5 that have the same stores and InfoStoresattributes, but have all NULL's in the attributes of sales. This join condition used for matchingis speci�ed in �5. In our example of Figure 2, one of the matching pairs is (2v5; v5), where 2v5 =<NY3,I3,1/1/92,100,NY3,Bu�alo,NY>2 2V5 and v5 =<NULL,NULL,NULL,NULL,NY3,Bu�alo, NY>2 V5. The match results in an update of v5 to 2v5, according to the update speci�cations in U5.Similarly, the other matching pair <NULL,NULL,NULL,NULL,MA1,Boston,MA,NULL, NULL,NULL> 2V5, and <MA1,I4,7/3/95,50,MA1,Boston,MA, NULL,NULL,NULL> 2 2V5 is handled. The remainingunmatched tuples in 2V5 are inserted into the view V5. The refreshed view V5 is shown in Figure 2.Given the small sizes of stores, InfoStores, 5sales, and 2V4, the number of tuple accessesrequired to compute the change tables and refresh V5 is 10,000 (to read 4sales) + 1,000 (to readstores) + 100 (to read InfoStores) + 20,000 (to refresh V5).This is the �rst paper to address maintenance of general view expressions involving outerjoinoperators. Recently, in work done concurrently with ours, [GK98] also reports an algorithm tohandle view expressions involving outerjoins, extending previous work on maintenance of outerjoinviews in [GJM97]. [GK98] uses insertion and deletion sets to propagate changes through outerjoinoperators. Thus, insertions in sales results in insertions and deletions at SSInfo, which in turnresult in insertions and deletions at SSFullInfo. However, according to the change propagationequations in [GK98], in order to compute the insertions and deletions at SSFullInfo, we have tocompute the intermediate view SSInfo, thereby incurring more than a billion tuple accesses. 28

Page 9: Incremental Maintenance of Aggregate and Outerjoin Expressions

3 The Change-Table Technique for View MaintenanceIn this section, we explain the framework developed in [QW91, GL95] for deriving incremental viewmaintenance expressions and relate it to the change-table technique developed in this article.Let a database contain a set of relations R = fR1; R2; : : : ; Rng. A change transaction t isde�ned to contain for each relation Ri the expression Ri ( Ri �� 5Ri) ] 4Ri; where 5Ri isthe set of tuples to be deleted from Ri, and 4Ri is the set of tuples to be inserted into Ri.Let V be a bag-algebra expression de�ned on a subset of the relations in R. The refresh-expression New(V; t)6 is used to compute the new value of V . Gri�n and Libkin in [GL95] de�nethe expression New(V; t) to be:New(V; t) = (V �� 5(V; t)) ] 4(V; t):So, the goal in deriving view maintenance expressions for a view V is to derive two functions5(V; t) and 4(V; t) such that for any transaction t, the view V can be maintained by evaluating(V �� 5(V; t)) ] 4(V; t). In order to derive 5(V; t) and 4(V; t), [GL95] gives change propagationequations that show how deletions and insertions are propagated up through each of the relationaloperators. The work of [GL95] was extended to include aggregate operators by Quass in [Qua97].The change-table technique presented in this article can be thought of as introducing a newde�nition for New(V; t). The expression New(V; t) for view expressions involving aggregates andouterjoins is de�ned as New(V; t) = (V tU� 2(V; t));where 2(V; t) is called the change table, tU� is the refresh operator used to apply the net changesin a change table to its view, and (�; U) are the parameters of the refresh operator. The parameter� speci�es the join conditions on the basis of which the tuples from the change table and the vieware matched, and U speci�es the functions that are used to update the matched tuples in V .The new de�nition of New(V; t) is motivated from the following observation. In the case ofgeneral view expressions involving aggregate operators, it is usually more e�cient to propagatethe change tables beyond an aggregate operator, instead of propagating insertions and deletions.Propagation of a change table is particularly e�cient when the change table depends only onthe changes to the base relation (self-maintainability [GJM96]), while the insertions and deletionsdepend on the old value of the view. As we showed in the motivating example, if the aggregatenode is not materialized, the computation of insertions and deletions could be very expensive.The new de�nition of New(V; t) means that we need to de�ne a general refresh operator, and de-rive change propagation equations for propagating change tables through various relational, aggre-gate and outerjoin operators, to obtain a complete technique for e�cient incremental maintenanceof general view expressions. In the following section, we present a formal de�nition of the refreshoperator. In later sections, we derive change propagation equations for general view expressionsinvolving aggregate and outerjoin operators.4 The refresh OperatorIn this section, we give a formal treatment of the refresh operator. Given a materialized table Vand its change table 2V , the refresh is used to apply the changes represented in a change table.6[GL95] uses the notation pre(t; V ) instead. 9

Page 10: Incremental Maintenance of Aggregate and Outerjoin Expressions

The binary refresh operator is a generalization of the refresh algorithm used in [MQM97] and canbe implemented using the modify operation discussed in [QM97].We denote the refresh operator by tU� , where � is a pair of two mutually exclusive joinconditions and U is a list of update function speci�cations. The refresh operator takes twooperands, a view V to be updated and a corresponding change table (denoted by 2V ).Let V and 2V be views with the same attribute names A1; A2; : : : ; An. The subscript � as-sociated with the operator is a pair of join conditions J 1 and J 2. The superscript U is a listspecifying the attributes that may be changed by the change operator, along with the update func-tions that specify the exact nature of the change to each attribute being changed. More precisely,U = <(Ai1 ; f1); (Ai2; f2); : : : ; (Aik ; fk)>, where Ai1 ; : : : ; Aik are attributes of V and f1; : : : ; fk arefunctions. Unless otherwise speci�ed, we will assume that the functions in U are binary functions.In an expression V tU� 2V , each tuple 2v of 2V is checked for possible matches (due to J 1 orJ 2) with tuples in V . If a match is found due to the join condition J 1, then the correspondingmatching tuple v of V is updated as described in the next paragraph (using the speci�cations inthe update list U). If the match is due to J 2, the tuple v of V is deleted. The unmatched tuplesin 2V are inserted into V . The matching done is one-to-one in the sense that a tuple 2v 2 2Vmatches with at most one tuple in V and vice-versa. If 2v �nds more than one match in V , thenan arbitrary matching tuple from V is picked.Update of a tuple v of V matching with the tuple 2v of 2V is done based on the updatefunctions in U . For each pair (Aij ; fj) in U , the Aij attribute of v is changed to fj(v(Aij); 2v(Aij)),where v(X) and 2v(X) denote the values of the X attribute of v and 2v respectively. Each pair inU is used to change the matching tuple v.EXAMPLE 2 Consider the view CategorySales (V3) de�ned in Example 1 earlier. The CategorySalestable as de�ned in Example 1 computes the total sales for each category. In this example, we illus-trate the refresh operation by applying the changes summarized in a change table 2CategorySalesto its view CategorySales using the refresh operator.For this example, we consider the instance of the base table shown in Figure 1. Figure 1also shows the materialized table CategorySales for the given instance. In response to the in-sertion of the table 4sales into the base table sales, the change table 2CategorySales canbe computed and is shown in the �gure. The view CategorySales is refreshed using the expres-sion CategorySales tU3�3 2CategorySales, where the parameters, �3 = (J 1;J 2) and U3, of therefresh operator are de�ned as follows.� J 1 is (�category ^ ((CategorySales:NumCaSales + (2CategorySales):NumCaSales) 6= 0))� J 2 is (�category ^ ((CategorySales:NumCaSales + (2CategorySales):NumCaSales) = 0))� U3 = <(SumCaSales; f); (NumCaSales; f)>, where f(x; y) = x+ y for any x; y.Note that�category represents the predicate (CategorySales.category= (2 CategorySales).category)here. Now, we try to run the refresh operation on the change table 2CategorySales and the viewCategorySales of Figure 1.The �rst tuple 2v3 = <C1,170,3> of 2CategorySalesmatches with the tuple v3 = <C1,690,4>in CategorySales using the join condition J 1. The match results in update of the tuple <C1,690,4>in CategorySales according to the speci�cations in the update list U3. The attribute SumCaSalesof the tuple v3 is changed to 2v3(SumCaSales) + v3(SumCaSales) = 170 + 690 = 860 and the10

Page 11: Incremental Maintenance of Aggregate and Outerjoin Expressions

attribute NumCaSales is changed to 4 + 3 = 7. Similarly, the tuple <C2,210,2> 2 CategorySalesmatches with the tuple <C2,120,2> of 2CategorySales and is updated to <C2,330,4>. The tuple<C3,480,2> of CategorySales is also updated to <C3,530,3> accordingly.To illustrate deletion from the view CategorySales, let us assume that the change table2CategorySales contains a tuple b = <C2,-210,-2> as a result of a deletion of a couple of tuplesfrom the sales table. The tuple b = <C2,-210,-2> will match with the tuple v = <C2,210,2> inV3 using the join condition J 2 and the match will result in deletion of v from CategorySales. 2Implementation of the refresh operator One simple way to implement the refresh opera-tor is to use a nested loop algorithm, with the change table as the outer table, and the materializedview table as the inner table. The algorithm is shown in Appendix B. The nested loop algorithmis just one possible way to implement the refresh operator. In fact, Quass and Mumick [QM97]show that the refresh operation can be implemented more e�ciently by using existing outerjoinmethods inside the DBMS.5 Propagating Change Tables Generated at AggregatesIn this section, we will derive the algebraic equations to (1) generate change tables at an aggregationnode in a view expression, and (2) propagate these change tables through the relational, aggregateand outerjoin operators in a general view expression. We start with a few de�nitions.De�nition 1 (Aggregate-change table) A change table for a view expression involving aggre-gates is de�ned as an aggregate-change table7 if the change table either originated at an aggregateoperator or is a result of propagation of a change table that originated at an aggregate node, usingthe propagation equations we will present in Table 3. 2For example, the change tables, 2SISales, 2CitySales, and 2CategorySales, computed for theviews SISales, CitySales, and CategorySales respectively in Example 1 are aggregate-changetables.We will show that a special case of the general refresh operator is su�cient to apply the changessummarized in an aggregate-change table to its view. The special form yields very simple changepropagation equations for propagation of aggregate-change tables through relational, aggregate andouterjoin operators.We use the notation Attrs(') to represent the set of attributes referenced in '. Thus, Attrs(U)refers to the set of attributes speci�ed in the update list U and Attrs(�) represents Attrs(J 1) [Attrs(J 2), where J 1 and J 2 are the join conditions in � = (J 1;J 2).5.1 Generating the Aggregate-Change TableConsider a view V de�ned as an aggregation over a select-project-join (SPJ) expression. In thissection, we give a brief description of how an aggregate-change table is generated at V in response7The notions of aggregate-change table and outerjoin-change table (introduced later in Section 6) have been de�nedonly for simplying the presentation of the material in this article.11

Page 12: Incremental Maintenance of Aggregate and Outerjoin Expressions

Aggregate Change due to NCOUNT(*) -1SUM(expr) -exprMIN(expr) exprMAX(expr) exprTable 2: Table used to change the aggregate attributes in deletionsto the insertions and deletions at the base tables. The method is similar to that of generating\summary-delta table" for a \summary table" [MQM97].For the case when a view V is de�ned as an aggregation over an SPJ expression, the insertionsand deletions to the base tables can be propagated to V as a single aggregate-change table, whichwe denote by 2V . Without loss of generality, let us assume V to be �G;B=f(A)(R), where R is theSPJ subview, G is the set of group-by attributes, f is an aggregate function, and A is an attribute ofR. In the case of distributive aggregate functions, the aggregate-change table 2V can be computedfrom the insertions and deletions into R by using the same generalized projection as that used forde�ning the view V . More precisely, 2V can be computed as2V = �G;f(A);Count=sum( count)(�G;A; count=1(4R) ] �G;A=N(A); count=�1(5R));where the function N is suitably de�ned depending on the aggregate function f as shown in Table 2.For example, in the case of a sum aggregate the function N negates the attribute value passed.Once the aggregate-change table has been computed, the net changes, now in the aggregate-changetable, are then applied to the view using the refresh operator.If the aggregate function f is self-maintainable (i.e., the new value can be computed from theold value of the function and the changes to the base data), then the special form of refreshoperator de�ned in the next subsection is su�cient to apply the change table 2V to the view V .For the case of aggregate functions that are not self-maintainable, we need more complex functionsin the update list U of the refresh operator. For example, to handle deletions from a subview ofa simple MIN/MAX aggregate view, the update function needs to compute the new MIN/MAXvalue, if needed, by accessing the base relations (see [MQM97] for details). The change propagationequations and other formalisms presented in this article are independent of the complexity and inputparameters of update functions used in the update list U of refresh.5.2 Refresh Operator for Applying Aggregate-Change TablesFor the case of applying net changes computed in an aggregate-change table to its view, the refreshoperator required is a special case of the generic operator de�ned earlier in Section 4. We will usethe special characteristics of the refresh operator to derive simpler change propagation equations.Recall that the expression used to refresh a view V using its change table 2V is: V = V tU� 2V ,where � = (J 1;J 2) and U is the update list. In the case of the refresh operator used to applyaggregate-change tables, J 1 is (�G ^ :p1) and J 2 is (�G ^ p1), for some predicate p1 and a setof attributes G common to both V and 2V . As de�ned before, the notation �G here representsthe predicate Vg2G(V:g = 2V:g), as V and 2V are the left and right operands of the join operator.12

Page 13: Incremental Maintenance of Aggregate and Outerjoin Expressions

New VNo. V � = (J 1;J 2) Refresh Equation 2V ConditionsJ 1 is �G ^ :p1J 2 is �G ^ p11 �p(E1) �p(E1 tU� 2E1) V tU� �p(2E1) �p(2E1) Attrs(p) � G2 �A(E1) �A(E1 tU� 2E1) V tU� �A(2E1) �A(2E1) Attrs(�) � AV tU�1 (2E1 � E2)3 E1 � E2 (E1 tU� 2E1) � E2 �1 = (J 1 ^ �� ;J 2 ^ �� ) 2E1 � E2� = Attrs(E2)V tU�1 (2E1 1JE2)4 E1 1JE2 (E1 tU� 2E1) 1JE2 �1 = (J 1 ^ �� ;J 2 ^ �� ) 2E1 1JE2 Attrs(J) � G� = Attrs(E2)5 E1 ] E2 (E1 tU� 2E1) ] E2 ((V �� E2) tU� 2E1) ] E2) 2E1V tU3�3 �G0;f(A)(2E1) Ai 2 Attrs(U ); G0 � G,6 �G0;F (E1) �G0;F (E1 tU� 2E1) �3 = (�G0 ^ :p3; �G0 ^ p3) �G0;F (2E1) fi's are distributive,F = f1(A1); p3 is (V:Cnt+2V:Cnt) 6= 0 and the function: : : ; fk(Ak). U3 = <(A1; f1); : : : ; (Ak; fk)> in U for Ai is fi.V tU�1 (2E1 lo./J E2)7 E1 fo./JE2 (E1 tU� 2E1) fo./ JE2 �1 = (J 1 ^ �� ;J 2 ^ �� ) (2E1 lo./JE2) Attrs(J) � G� = Attrs(E1)Table 3: Change propagation equations for propagating aggregate-change tablesAlso, the set of attributes G is disjoint from the set of attributes, Attrs(U), that are being updated.The predicate p speci�es when the matching tuple in V is to be deleted, i.e., when the value of theattribute that stores the number of deriving base tuples becomes zero. The above characteristicsare summarized in the de�nition of an aggregate-refresh operator below.De�nition 2 (Aggregate-refresh Operator) A refresh operator tU� , where � = (J 1;J 2), issaid to be an aggregate-refresh operator if, for some predicate p1 and a set of attributes Gcommon to both the view and its aggregate-change table,(1) The join conditions J 1 and J 2 can be represented as:� J 1 : (�G ^ :p1)� J 2 : (�G ^ p1), and(2) G \Attrs(U) = �. 2The characteristics of the aggregate-refresh operator are used to derive simple change prop-agation equations for propagating aggregate-change tables.5.3 Change Propagation EquationsFor the purposes of change propagation equations shown in Table 3, we assume that an aggregate-change table has been generated (as shown in Section 5.1) at the �rst aggregate operator in a13

Page 14: Incremental Maintenance of Aggregate and Outerjoin Expressions

view expression, in response to insertions and/or deletions at a base relation.8 Table 3 giveschange propagation equations for propagating (already generated) aggregate-change tables throughrelational, aggregate and outerjoin operators. In Theorem 1, we prove the correctness of the changepropagation equations of Table 3.Each row in the table considers propagation of an aggregate-change table through a relational,aggregate, or outerjoin operator. The �rst column gives the equation number used for later referencein examples. The second column in the table gives the outermost operator used in the de�nitionof V . The third column expresses the new expression for V due to a change occuring at oneof the subexpressions E1 of V . The third column is the same as the second except that thesubexpression E1 is replaced by (E1 tU� 2E1), where 2E1 is an aggregate-change table and tU�is an aggregate-change operator. We assume � to be (J 1;J 2); where J 1 is �G ^ :p1 and J 2is �G ^ p1, for some predicate p1 and a set of attributes G in E1. The fourth column gives therefresh equation New(V; t) that is used to refresh the view V . Theorem 1 proves that the refreshequation given in the fourth column is equivalent to the expression in the third column for eachrow of Table 3. Theorem 1 also proves that the refresh operator used in the refresh equationsof fourth column is an aggregate-refresh operator. The �fth column gives the expression forthe propagated aggregate-change table 2V , which can be derived from the refresh equation ofthe fourth column. Finally, the last column of the table states the conditions under which theequivalence of the fourth column and third column expressions holds, i.e., conditions under whichthe change propagation can be done. If the condition is not satis�ed, then the refresh equationcannot be used to propagate the aggregate-change table. We will show later how to handle changesat an operator node when the conditions in the sixth column are not satis�ed.The �rst row of the table depicts the case of a selection view V = �p(E1), where E1 is asubexpression. In this case, the expression for the propagated aggregate-change table is: 2V =�p(2E1), as shown in the �fth column. For the case of selection view, the condition required for thechange propagation is Attrs(p) � G. In other words, an aggregate-change table can be propagatedthrough a selection operator if the selection condition is de�ned over the G attributes, which arethe attributes that are not being updated by the update functions of the refresh operator.The second row depicts the case of projection on a set of attributes A. The condition impliesthat an aggregate change-table can be propagated through a duplicate-preserving projection onlyif all the attributes used in the join conditions of � are included in the projection list A.The third row considers the case of a cross product operation. In this case, the parameter � ofthe refresh operator is changed to also include the condition �Attrs(E2) in the join conditions J 1and J 2. No conditions are speci�ed in the �fth column meaning that an aggregate-change tablecan always be propagated through a cross product operator.The fourth row depicting the case of a join operation follows from the combination of previouscases of selection and cross product.The �fth row considers propagation of an aggregate-change table through the bag union oper-ator. The expression for 2V in the �fth column gives only the change table (2V = 2E1). Asthe refresh equation in the fourth column shows, the refresh in this case is more complex than8When two or more base relations are updated simultaneously (or a relation appears more than once in a viewexpression), we handle the updates in a sequence. 14

Page 15: Incremental Maintenance of Aggregate and Outerjoin Expressions

simply applying the change table. We need to �rst apply a set of deletions (5V = E2), followed byrefreshing the result with the change table (2V = 2E1), followed by inserting the set (4V = E2)into the result. One can derive a very e�cient refresh equation for the case of bag union operator,if the tuples in V are \tagged" L/R depending on whether they come from the left operand E1 orthe right operand E2. We omit the trivial details here.The sixth row depicts the case of propagation through a generalized projection (aggregate) op-erator. For the purposes of aggregate-change tables, we assume that any subexpression involving anaggregate operator stores with each tuple a count of the number of deriving base tuples. This countis stored in a general attribute which we will call the count attribute. For example, NumCiSalesand NumCaSales are count attributes in the views CitySales and CategorySales of Example 1.After propagation through the aggregate operator, the join conditions and the update speci�ca-tions of the aggregate-refresh operator change as shown in the fourth column. The attributenamed Cnt, used in de�ning p3 in the join conditions of �3 represents the count attribute of V .The condition in the �fth column says that each aggregate function fi is distributive, the set ofgroup-by attributes G0 is a subset of G, and the update list U used to change E1 contains (Ai; fi)for all 1 � i � k. Note that the duplicate-elimination operation is a special case of generalizedprojection, and is covered by the sixth row.The seventh row gives the refresh equation for the case of propagating an aggregate-change tablethrough a full outerjoin operation. In this case, the refresh operator speci�cations change as inthe case of cross product. For simplicity, we have assumed that the aggregate-change table 2E1doesn't result in any deletions from E1. Otherwise, a more extended refresh operator is requiredas discussed in Section C.Note that we do not give any change propagation equation for the case of monus because monusis a singularity point (see paragraph below).Singularity Points We call the operator nodes in a view expression tree, where none of therefresh equations in Table 3 apply, as singularity points. Aggregate-change tables cannot be prop-agated through singularity points. For example, a selection on the result of an aggregate functionis a singularity point, because it will not satisfy the condition Attr(p) � G given in the �rst rowof Table 3.Consider a view V and a singularity point V1, which is a subexpression of V , in the expressiontree of V . As the changes to V1 cannot be summarized into a change table, we instead computeinsertions (4V1) and deletions (5V1) into V1 and propagate the insertions and deletions beyondV1. The tables 4V1 and 5V1 can be easily computed from the change table of its descendant inthe expression tree. The computed insertions and deletions at a singularity point can then bepropagated further upwards in the expression tree using techniques presented in this article (asthey may result in change tables further on). Hence, presence of singularity points in an expressiontree doesn't preclude application of our change-table techniques for incremental maintenance.Theorem 1 Assume that the refresh operator used in the expression of the third column in Table 3is an aggregate-refresh operator. Then,(1) the change propagation equations given in Table 3 for propagation of aggregate-change tablesare correct, i.e., for each row, the expression in the third column is equivalent to the refresh equation15

Page 16: Incremental Maintenance of Aggregate and Outerjoin Expressions

in the fourth column, and(2) the refresh operator derived in the refresh equation (column 4) is an aggregate-refreshoperator as well. 2EXAMPLE 3 In this example, we illustrate the techniques developed in this section on the viewsof the motivating Example 1. Recall from Example 1 the de�nitions of V1 (SISales); V2 (CitySales);and V3 (CategorySales). Thus, we haveV1 = �storeID;itemID;SumSISales=sum(price);NumSISales=count(�)(�date>1=1=95sales)V 02 = V1 1 storesV2 = �city;SumCiSales=sum(SumSISales);NumCiSales=sum(NumSISales)(V 02)V 03 = V1 1 itemsV3 = �category;SumCaSales=sum(SumSISales);NumCaSales=sum(NumSISales)(V 03)where the (virtual) views V 02 and V 03 have been added for better illustration of how the aggregate-change tables propagate. We use the change propagation equations of Table 3 to derive the main-tenance expressions for V2 and V3, in response to changes in sales, as follows.2V1 = �storeID;itemID;SumSISales=sum(price);NumSISales=sum( count)(�storeID;price; count=1(�date>1=1=954sales)] �storeID; price= �price; count= �1(�date>1=1=955sales)) [From Section 5:1]V1 = V1 tU�1 (2V1); where �1 is ( �fstoreID;itemIDg ^ :p; �fstoreID;itemIDg ^ p)2V 02 = 2V1 1 storesV 02 = V 02 tU�12 2V 02; [From (4) in Table 3]where �12 is ( �fstoreID;itemIDg [ Attrs(stores) ^ :p; �fstoreID;itemIDg [ Attrs(stores) ^ p)2V2 = �city;SumCiSales=sum(SumSISales);NumCiSales=sum(NumSISales)(2V 02)V2 = V2 tU�2 2V2; where �2 is ( �city ^ :p; �city ^ p) [From (6) in Table 3]2V 03 = 2V1 1 itemsV 03 = V 03 tU�13 2V 03; [From (4) in Table 3]where �13 is ( �fstoreID;itemIDg [ Attrs(items) ^ :p; �fstoreID;itemIDg [ Attrs(items) ^ p)2V3 = �category;SumCaSales=sum(SumSISales);NumCaSales=sum(NumSISales)(2V 03)V3 = V3 tU�3 2V3; where �3 is ( �category ^ :p; �category ^ p) [From (6) in Table 3]In all the above equations, U is of the form<(SUM; f); (COUNT; f)> and p is of the form ((LHS:COUNT+RHS:COUNT) = 0),9 where SUM is the aggregated attribute (SumSISales, SumCiSales, or Sum-CaSales) in the corresponding view, COUNT is the count attribute (NumSISales, NumCiSales, orNumCaSales) depending on the view, and f(x; y) = x+ y for all x; y.As shown in Example 1, the above derived maintenance expressions for V2 and V3 are verye�cient compared to the expressions derived by previous approaches. 29Recall that LHS and RHS refer to the left and right operands of the join operation where p occurs.16

Page 17: Incremental Maintenance of Aggregate and Outerjoin Expressions

6 Maintaining Outerjoin Views E�cientlyIn this section, we show how our change-table techniques can be used to derive e�cient and simplealgebraic expressions for maintenance of view expressions involving outerjoin operators. Outerjoinis supported in SQL. Further, outerjoins have recently gained importance because data from mul-tiple distributed databases can be integrated by means of outerjoin views [GJM96, GM95, RU96].Outerjoins are also extensively used in object-relational systems [BW89, BW90, BPP+93].De�nition 3 (Outerjoin-change table) A change table for a view involving outerjoin operatorsis de�ned as an outerjoin-change table if the change table was either generated at an outerjoinoperator or is a result of propagation of an outerjoin-change table, using the propagation equationswe will derive for propagating outerjoin-change tables. 2For example, the change tables, 2SSInfo and 2SSFullInfo, computed for the views SSInfo andSSFullInfo respectively in Example 1 are outerjoin-change tables.We start by showing how the insertions and deletions into an outerjoin view (R fo./J S), inresponse to insertions into the base table R, can be summarized into an outerjoin-change table.The net changes represented in the outerjoin-change table are then applied to the outerjoin viewusing the refresh operator. Computation of an outerjoin-change table at an outerjoin view inresponse to deletions from a base table reqeires a more general refresh operator and is brie ydiscussed in Appendix C.6.1 Generating Outerjoin-Change Table At An Outerjoin NodeGiven tables R(A1; A2; : : : ; An) and S(B1; B2; : : : ; Bm), consider an outerjoin view V (A1; : : : ; An;B1; : : : ; Bm) = R fo./J S, where J is an equi-join condition. Insertions into R, represented by4R, can result in some insertions and deletions into view V . We summarize the set of insertionsand deletions to V in an outerjoin-change table 2V de�ned as 2V = 4R lo./J S. Note that thetables 2V and V have the same schema and attribute names. We show that with the followingspeci�cation of the refresh operator, the net changes in the outerjoin-change table 2V can beapplied to the view V to obtain the correctly refreshed V . The speci�cations, � = (J 1;J 2) and U ,of the refresh operator used to apply 2V to V are de�ned as follows.� J 1 is (�G ^ p), where G = Attrs(S) and p = (V1�j�n(V:Aj = NULL)).� J 2 is FALSE.� The update list U is <(A1; f); (A2; f); : : : ; (An; f)>, where f(x; y) = y for all x; y.Theorem 2 Consider the view V = R fo./J S and the outerjoin-change table 2V = 4R lo./J S.For the above de�nition of the refresh operator speci�cations of � = (J 1;J 2) and U , the followingholds: (R ] 4R) fo./J S = (R fo./J S) tU� (2V ) 26.2 Propagating Outerjoin-Change TablesOnly a special form of the generic refresh operator, which we call an outerjoin-refresh operator,is required to refresh a view using its outerjoin-change table.17

Page 18: Incremental Maintenance of Aggregate and Outerjoin Expressions

De�nition 4 (Outerjoin-refresh operator) Let fA1; A2; : : : ; An; B1; B2; : : : ; Bmg be the set ofattributes in V and its outerjoin-change table 2V . A refresh operator tU� used to apply theouterjoin-change table 2V to its view V is said to be an outerjoin-refresh operator if:� the join condition J 1 is (�G ^ p), where G = fB1; B2 : : : ; Bmg and p is a predicate on theattributes (LHS:A1; LHS:A2; : : : ; LHS:An).� the join condition J 2 is FALSE, and� the update list U is <(A1; f); (A2; f); : : : ; (An; f)>, where f(x; y) = y for all x; y. Note that(Attrs(U) \ G) = � and (Attrs(U)[ G) = Attrs(V ). 2The refresh equations given in Table 3 correctly propagate an outerjoin-change table as well,except for the case of propagation through the outerjoin operator, for which we derive a di�erentequation below.Consider a view V = E1 fo./J E2, where E1 and E2 are general view expressions. Suppose thatthe expression E1 is refreshed to E1 tU� 2E1 using its outerjoin-change table 2E1, where the refreshoperator tU� is an outerjoin-refresh operator. Let � be (�G ^ p; FALSE), where p is a predicate,and G is a set of attributes common to E1 and 2E1. The following row, which replaces the row (7)in Table 3, shows how to propagate an outerjoin-change table 2E1 through the outerjoin operator.7b E1 fo./JE2 (E1 tU� 2E1) fo./JE2 V tU1�1 (2E1 lo./JE2) (2E1 lo./JE2) Attrs(J) � GAs already mentioned, � is (�G ^ p; FALSE) in the row above. Also,� �1 = (J 1;J 2), where J 1 is �Attrs(E2) ^ ((p ^ �G) _ (Ve12Attrs(E1)LHS:e1 = NULL));J 2 isFALSE, and� U1 = <(A1; f); (A2; f); : : : ; (Ak; f)>; where fA1;A2; : : : ;Akg = Attrs(E1) and f(x; y) = yfor all x; y:Theorem 3 Assume that the refresh operator used in the expression of the third column in Table 3is an outerjoin-refresh operator.(1) The change propagation equations given in Table 3, with the following two changes, correctlypropagate outerjoin-change tables.� Disregard the condition in column 6 of the �rst row (selection view)� Replace the seventh row by row (7b) given above(2) The refresh operator derived in each of the refresh equations (column 4) is also anouterjoin-refresh operator. 2EXAMPLE 4 Consider a view V = �A;B;F=sum(D);H=sum(E);Num=Count(�)(�A<5((R fo./C=DS) 1 T )),where R(A;B;C); S(D;E), and T (A;B; L) are base relations, and 1 is the natural join operation,i.e., a join with the join condition (�fA;Bg). Recall that for computing the SUM aggregates, theattribute value of NULL is taken as 0, provided at least one tuple has a non-NULL value.For clarity of presentation, let us assume V1 = R fo./C=DS; V2 = V1 1 T; V3 = �A>5(V2): Letus de�ne a predicate p as ((LHS:D = NULL) ^ (LHS:E = NULL)), a predicate q as ((LHS:Num+RHS:Num) = 0), an update list U1 as <(F; f); (H; f)>, and U as <(D; g); (E; g); (Num; g)>.Here, f(x; y) = y for all x; y, g(x; y) = x+y for all x; y 6= NULL, and g(NULL; y) = y. We illustrate ourtechniques of maintaining views involving outerjoin operators by deriving maintenance expressions18

Page 19: Incremental Maintenance of Aggregate and Outerjoin Expressions

for V in response to insertions, 4S, into S. The equation used for computing 2V1 is similar to thatderived in Theorem 2. In this case, we have insertions into S and hence, we use a right outerjoinoperation instead.2V1 = (R ro./C=D 4S)V1 = V1 tU1�1 2V1; where �1 is ( �Attrs(R) ^ p; FALSE) [From Theorem 2]2V2 = 2V1 1 TV2 = V2 tU1�2 2V2;where �2 is ( �Attrs(R) [ Attrs(T ) ^ p; FALSE) [From (4) in Table 3]2V3 = �A>5(2V2)V3 = V3 tU1�2 2V3 [From (1) in Table 3]2V = �A;B;F=Sum(D);H=Sum(E);Num=Count(�)(2V3)V = V tU� 2V;where � is ( �fA;Bg ^ :q; �fA;Bg ^ q) [From (4) in Table 3] 27 Optimality IssuesIn this section, we discuss the optimality of the incremental maintenance algorithm based on ourchange-table techniques. For simplicity and concreteness, we consider the simplistic cost modelwherein the cost incurred by a maintenance algorithm in maintaining a view in response to somechanges at the base relations is the number of sources (base relations) queried. The above costmodel is reasonable in a data warehouse model where the dominant cost is the cost incurred inquerying the sources.Let us consider change-table techniques presented in this article for incremental maintenanceof general view expressions along with the following two minor optimizations:� In computing a view, we tag the tuples in the result of a union operator with L/R depending onwhether the tuple comes from the left operand or the right operand. The above improvementmakes propagation of an aggregate and outerjoin change-table through a union operator verye�cient. In essense, the refresh equation of the �fth row in Table 3 will not involve E1 or E2.� When computing the refresh equation, the view contents are used, whenever possible, tocompute a required subexpression using minimum number of source queries. For e.g., considerV = R � (S � T ). In response to insertions into R, to compute changes at V , we need toquery (S � T ) according to the propagation equations in Table 3. Instead, we query R andcompute (S � T ) using the value of V , minimizing the number of source queries.Incorporating the above two improvements into our change-table techniques, it is not di�cultto show the following.Theorem 4 Given an expression tree of a view V , the change-table technique queries the minimumnumber of sources required in order to compute changes at each node in the expression tree. 2We de�ne an algorithm to be a change-propagating algorithm if it computes changes (in someform) at each node in the given expression tree of the view. The above theorem shows that our19

Page 20: Incremental Maintenance of Aggregate and Outerjoin Expressions

change-table technique is better than any change-propagating maintenance algorithm for a givenexpression tree (even in presence of singularity points). Though, its not necessary for an incre-mental maintenance algorithm to be change-propagating, all the previously proposed maintenancealgorithms fall in that category.An interesting open problem is to design a provably optimal incremental maintenance algorithmunder the above cost model for maintenance of general view expressions (without restricting our-selves to the class of change-propagating algorithms). We have focussed our recent work on adaptingthe maintenance algorithm based on our change-table techniques to obtain an optimal approach.We conjecture that the change-table technique with some minor improvements/optimizations canbe translated into an optimal incremental maintenance algorithm under the above cost model.8 Related WorkA large body of work exists describing di�erent algorithms for incrementally maintaining mate-rialized views [BLT86, RK86, Han87, BCL89, CW91, QW91, GMS93, GLT94, GL95, CGL+96,GJM97, Qua97, GK98]. Each work applies to di�erent classes of views and has various advantagesand disadvantages. Below, we discuss some of the above algorithms.Qian and Wiederhold in [QW91] present a technique (later corrected in [GLT94]) to propagatesets of insertions and deletions through relational algebra operators without duplicates. The tech-niques in [QW91] were extended to bag algebra by Gri�n and Libkin in [GL95]. While [QW91]did not deal with aggregates at all, [GL95] did consider aggregates when they are applied as thelast operator in an expression, and when there are no groupby columns. The work of [GL95] wasfurther extended by Quass [Qua97] to include general expressions involving aggregation. However,as illustrated in Section 1, the technique of [Qua97] works with bags of insertions and deletions,and is thereby less e�cient, and more complex, than the change table technique presented in thispaper. None of [QW91, GL95, Qua97] can deal with outerjoins.Gupta et al. [GMS93] also present algorithms for incrementally maintaining views with dupli-cates. Aggregation is considered only for the case where a view is materialized at each aggregationnode. Outerjoins are not considered in [GMS93]. Mumick et al. in [MQM97] give an algorithm fore�ciently maintaining a set of summary tables. A summary table is the result of applying a singleaggregation over an SPJ expression over star schema tables in a data warehouse. [MQM97] doesnot consider outerjoins, or general view expressions involving aggregate operators.Gupta et al. in [GJM97] present maintenance and self-maintenance algorithms to compute theincremental changes to a materialized outerjoin view R fo./ S, where R and S are base tables.The algorithms presented in [GJM97] are procedural rather than algebraic, and do not apply togeneral view expressions containing outerjoin operators. In recent work done concurrently withours, Gri�n and Kumar in [GK98] extend the techniques of [GJM97] by deriving propagationequations through outerjoin operators. As [GK98] propagates changes in the form of insertions anddeletions, their incremental algorithm is less e�cient than our change-table techniques for generalview expressions, as brie y illustrated in our motivating example.The problem of incremental view maintenance is closely related to the problem of self-maintainabilityof views [GJM94, QGMW96, GJM97]. A view is de�ned as self-maintainable with respect to certain20

Page 21: Incremental Maintenance of Aggregate and Outerjoin Expressions

kinds of changes if the view can be updated using the old view value and the changes, without access-ing any base relations. The change-table technique presented in this articles, in most cases, derivesmaintenance expressions that do not refer to the base tables, even when such self-maintenanceexpressions were not possible using only insertions and deletions as types of updates. Hence, thetechniques presented in this article help in deriving e�cient self-maintenance expressions.9 ConclusionsIn this article, we have developed a change-table technique for incremental maintenance of generalview expressions involving aggregate and outerjoin operators. To the best of our knowledge, this isthe �rst paper to present algebraic expressions for maintaining general views involving aggregateand outerjoin operators.Traditional maintenance techniques [QW91, GL95, Qua97] propagate insertions and deletionsfrom the base relations to the view through each of its operators. In contrast, we compute changetables at an aggregate or outerjoin operator, and use change propagation equations to propagatethe change tables through the relational, aggregate and outerjoin operators. We show that thechanges represented in change tables can be applied to its corresponding materialized view usingan appropriately de�ned refresh operator. The resulting maintenance expressions for general viewexpressions are simple and very e�cient compared to previous techniques.The change-table technique presents a new paradigm for view maintenance using change tables.The maintenance expressions derived by the change-table techniques are usually self-maintenanceexpressions, as they usually refer only to the view and the changes to the base tables, minimizingthe number of queries to the base tables. Such a paradigm is likely to encourage research intodeveloping more e�cient maintenance and self-maintenance expressions than are possible using theinsertion/deletion paradigm.Our future work focuses on translating our techniques into a provably optimal incrementalmaintenance approach under some reasonable cost model.AcknowledgementsWe would like to thank Prof. Je� Ullman for very helpful suggestions on improving the presen-tation of this article.References[BCL89] J. Blakeley, N. Coburn, and P. Larson. Updating derived relations: Detecting irrelevant andautonomously computable updates. ACM Transactions on Database Systems, 14(3):369{400,September 1989.[BLT86] J. Blakeley, P. Larson, and F. Tompa. E�ciently Updating Materialized Views. In CarloZaniolo, editor, Proceedings of ACM SIGMOD 1986 International Conference on Managementof Data, pages 61{71, Washington, D.C., May 28-30 1986.[BM90] J.A. Blakeley and N.L. Martin. Join index, materialized view, and hybrid hash join: A per-formance analysis. In Proceedings of the Sixth IEEE International Conference on Data Engi-neering, pages 256{263, Los Angeles, CA, February 5-9 1990.21

Page 22: Incremental Maintenance of Aggregate and Outerjoin Expressions

[BPP+93] B.Mitschang, H. Pirahesh, P. Pistor, B. Lindsay, and N. Sudkamp. Sql/xnf - processing compos-ite objects as abstractions over relational data. In Proceedings of the Ninth IEEE InternationalConference on Data Engineering, Vienna, Austria, April 1993.[BW89] T. Barsalou and G. Wiederhold. Knowledge based mapping of relations into objects. InProceeding of Computer Aided Design, 1989.[BW90] B.S.Lee and G. Wiederhold. Outer joins and �lters for instantiating objects from relationalsdatabases through views. Technical report, CIFE - Stanford University, 1990.[CGL+96] L. Colby, T. Gri�n, L. Libkin, I. Mumick, and H. Trickey. Algorithms for deferred viewmaintenance. In Proceedings of ACM SIGMOD 1996 International Conference on Managementof Data, pages 469{480, 1996.[CKL+97] L.S. Colby, A. Kawaguchi, D.F. Lieuwen, I.S. Mumick, and K.A. Ross. Supporting multipleview maintenance policies. In Joan M. Peckman, editor, Proceedings of ACM SIGMOD 1996International Conference on Management of Data, pages 256{263, Tucson, Arizona, June 1997.[CW91] S. Ceri and J. Widom. Deriving production rules for incremental view maintenance. In Guy M.Lohman, Amilcar Sernadas, and Rafael Camps, editors, Proceedings of the Seventeenth Inter-national Conference on Very Large Databases, pages 108{119, Barcelona, Spain, September3-6 1991.[GHQ95] A. Gupta, V. Harinarayan, and D. Quass. Generalized projections: A powerful approach toaggregation. In Proceedings of the 21st International Conference on Very Large Databases,Zurich, Switzerland, September 11-15 1995.[GJM94] A. Gupta, H. V. Jagadish, and I.S. Mumick. Data integration using self-maintainable views.Technical Memorandum 113880-941101-32, AT&T Bell Laboratories, November 1994.[GJM96] A. Gupta, H. Jagadish, and I. S. Mumick. Data integration using self-maintainable views. InProceedings of the Fifth International Conference on Extending Database Technology, pages140{144, Avignon, France, March 1996. Industrial Session.[GJM97] A. Gupta, H.V. Jagadish, and I.S. Mumick. Maintenance and self-maintenance of outerjoinviews. In Proceedings of the NGITS, Tel Aviv, Israel, June 1997.[GK98] T. Gri�n and B. Kumar. Algebraic change propagation for semijoin and outerjoin queries.SIGMOD Record, 27(3), September 1998.[GL95] T. Gri�n and L. Libkin. Incremental maintenance of views with duplicates. In Proceedingsof the ACM SIGMOD International Conference on Management of Data, pages 316{327, SanJose, California, May 1995.[GLT94] T. Gri�n, L. Libkin, and H. Trickey. A correction to \incremental recomputation of activerelational expressions" by Qian and Wiederhold. Technical report, AT&T Bell Laboratories,Murray Hill NJ, 1994.[GM95] A. Gupta and I. Mumick. Maintenance of materialized views: Problems, techniques, andapplica tions. Special Issue on Materialized Views and Data Warehousing, IEEE Data Engi-neering Bulletin, 18(2):3{19, June 1995.[GM99] H. Gupta and I.S. Mumick. Incremental maintenance of aggregate and outerjoin expressions.Technical report, Stanford University, 1999.[GMS93] A. Gupta, I. Mumick, and V. Subrahmanian. Maintaining views incrementally. In Proceedingsof ACM SIGMOD 1993 International Conference on Management of Data, Washington, DC,May 26-28 1993. 22

Page 23: Incremental Maintenance of Aggregate and Outerjoin Expressions

[Han87] E. Hanson. A performance analysis of view materialization strategies. In Umeshwar Dayaland Irv Traiger, editors, Proceedings of ACM SIGMOD 1987 International Conference onManagement of Data, pages 440{453, San Francisco, CA, May 27-29 1987.[MQM97] I. Mumick, D. Quass, and B. Mumick. Maintenance of data cubes and summary tables in awarehouse. In Proceedings of the ACM SIGMOD International Conference of Mangement ofData, Tucson, Arizona, June 1997.[QGMW96] D. Quass, A. Gupta, I. Mumick, and J. Widom. Making views self-maintainable for datawarehousing. In Proceedings of the Fifth International Conference on Parallel and DistributedInformation Systems (PDIS), 1996.[QM97] D. Quass and I. Mumick. Optimizing the refresh of materialized views. UnpublishedManuscript, 1997.[Qua97] D. Quass. Materialized Views in Data Warehouses. PhD thesis, Stanford University, Depart-ment of Computer Science, 1997. Chapter 4. Preliminary version appears as MaintenanceExpressions for Views with Aggregation in the ACM Workshop on Materialized Views, 1996.[QW91] X. Qian and G. Wiederhold. Incremental recomputation of active relational expressions. IEEETransactions on Knowledge and Data Engineering, pages 337{341, 1991.[RK86] N. Roussopoulos and H. Kang. Principles and techniques in the design of ADMS�. IEEEComputer, pages 19{25, December 1986.[RU96] A. Rajaraman and J. Ullman. Integrating information by outerjoins and full disjunctions. InProceedings of the Fifteenth Symposium on Principles of Database Systems (PODS), Montreal,Canada, June 1996.A Maintenance expressions derived using techniques in [Qua97].For our purposes, its not important to understand the maintenance expressions given below andthey are given here primarily to show their complexity. We have used 1 to denote a naturaljoin operation, 1G to denote an equi-join operation on a set of attributes G, >< to denote ananti-semijoin operation, and �A to denote the generalized projection symbol, which represent thegroupby operation of SQL as described in [GHQ95] and Section 1.1. Also, l:a and r:a refer to theattribute a of the left and right operands respectively.Let A1 = fstoreID; itemID; SumSISales;NumSISalesgLet A1;ins = fstoreID; itemID; price; count = 1gLet A1;del = fstoreID; itemID; price = �price; count = �1gLet A1;� = fstoreID; itemID; SumSISales = sum(price);NumSISales = sum( count)g�sales = �A1;ins(�(date>1=1=95)(4sales)) ] �A1;del(�(date>1=1=95)(5sales))5(V1) = �r:aja2A1(�A1;� (�sales) 1storeID;itemID V1)4(V1) = �f(r:a+l:a)ja2A1g( �r:NumSISales+l:NumSISales>0(�A1;� (�sales) 1storeID;itemID V1))] �l:NumSISales>0 (�A1;� (�sales) ><storeID;itemID V1)23

Page 24: Incremental Maintenance of Aggregate and Outerjoin Expressions

Let A2 = fcity; SumCiSales;NumCiSalesgLet A2;ins = fcity; SumSISales;NumSISalesgLet A2;del = fcity; SumSISales = �SumSISales;NumSISales = �NumSISalesgLet A2;� = fcity; SumCiSales = sum(SumSISales);NumCiSales = sum(NumSISales)g�2 = �A2;ins(4V1 1 stores) ] �A2;del(5V1 1 stores)5(V2) = �r:aja2A2( �A2;� (�2) 1storeID;itemID V2)4(V2) = �f(r:a+l:a)ja2A2g( �r:NumCiSales+l:NumCiSales>0(�A2;� (�2) 1city V2))] �l:NumCiSales>0 (�A2;� (�2) ><city V2)Let A3 = fcategory; SumCaSales;NumCaSalesgLet A3;ins = fcategory; SumSISales;NumSISalesgLet A3;del = fcategory; SumSISales = �SumSISales;NumSISales = �NumSISalesgLet A3;� = fcategory; SumCaSales = sum(SumSISales);NumCaSales = sum(NumSISales)g�3 = �A3;ins(4V1 1 items) ] �A3;del(5V1 1 items)5(V3) = �3:aja2A3( �A3;� (�3) 1storeID;itemID V3)4(V3) = �f(r:a+l:a)ja2A3g( �r:NumCaSales+l:NumCaSales>0(�A3;� (�3) 1category V3))] �l:NumCaSales>0 (�A3;� (�3) ><category V3)Computing Tuple Accesses As V1 is not materialized, computing tuples of 4V1 requires thatfor each tuple in �A1;� (�sales), we must look up all tuples of sales that have the same storeID anditemID values. Given the database sizes of Table 1, assume that each tuple of V1 is derived from1,000 tuples of sales on an average. Thus, computing 600 tuples of 4V1 requires 600,000 tupleaccesses. We need 11,000 accesses to read the base relations stores and items into main-memory.Even assuming that rest of the computation can be done in main memory, the total number oftuples accesses to refresh the views V2 and V3 is 10; 000+ 600; 000 + 11; 000 + 2; 020, where 2,020tuple accesses are due to the �nal tuple updates in V2 and V3. Note that each update requires aread and a write access.B Refresh AlgorithmAlgorithm 1 Refresh AlgorithmInputTable V (A1; A2; : : : ; An). %Change Table 2V (A1; A2; : : : ; An). %� = (J 1;J 2); U = <(Ai1 ; f1); (Ai2; f2); : : : ; (Aik ; fk)>OutputRefreshed table V , i.e., V tU� 2V .MethodDECLARE cursor box CURSOR FORSELECT A1; A2; : : : ; An FROM 2V ;24

Page 25: Incremental Maintenance of Aggregate and Outerjoin Expressions

OPEN cursor box;LOOPFETCH cursor box INTO :2a1; :2a2; : : : ; :2an;UNTIL not-foundDECLARE cursor view CURSOR FORSELECT A1; A2; : : : ; An FROM VWHERE (J 1(A1; A2; : : : ; An; :2a1; :2a2; : : : ; :2an) OR J 2(A1; : : : ; An; :2a1; : : : ; :2an))AND (NOT changed);OPEN cursor view;LOOPFETCH cursor view INTO :a1; :a2; : : : ; :an;UNTIL not-foundIF not-foundINSERT INTO V VALUES (:2a1; :2a2; : : : ; :2an);ELSE-IF J 2(:a1; :a2; : : : ; :an; :2a1; :2a2; : : : ; :2an)DELETE FROM V WHERE CURRENT OF cursor view;break;ELSEUPDATE V SET update attributes =function of update attributes from cursor box and cursor view;Mark the tuple at the CURRENT OF cursor view as changed.break;ENDIFENDLOOPCLOSE cursor view;ENDLOOPCLOSE cursor box; 3C Propagation of Deletions through Outerjoin OperatorsThe changes in an outerjoin V = R fo./J S due to deletions from a base relation R cannot besummarized in an outerjoin-change table within our restricted de�nition of refresh operator. Inthis section, we show that by using an extended version of the refresh operator, we can summarizeand propagate changes into an outerjoin view in response to deletions from a base table.Let Sset denote the expression �Attrs(S);Num=Count(�)(S) and 5R be the set of deletions fromR. We de�ne an OJdeletion-change table, 2V , as 5R lo./J Sset. The view V is refreshed usingthe refresh equation V tU� 2V , where � = (�Attrs(V );�Attrs(S)) and U = <(A1; f); : : : ; (An; f)>,where f(x; y) = NULL for all x; y and fA1; : : : ; Akg = Attrs(R). The refresh operator used in theabove refresh equation is called the OJdeletion-refresh operator and is de�ned using the refreshalgorithm (Algorithm 2). The refresh algorithm is used to refresh a view V using its OJdeletion-change table. The propagation equation (7) of of Table 3 for propagating an aggregate-change tablethrough an outerjoin-operator can be be extended by introducing counts, in the manner described25

Page 26: Incremental Maintenance of Aggregate and Outerjoin Expressions

above.Algorithm 2 Refresh Algorithm to apply an OJdeletion-change table to its viewInputView V (A1; : : : ; An; B1; : : : ; Bm)Change Table 2V (A1; : : : ; An; B1; : : : ; Bm; Num)� = (J 1;J 2); U = <(A1; f1); (A2; f2); : : : ; (Ak; fn)>OutputRefreshed table V , i.e., V tU� 2V .ProcedureBEGINfor each tuple 2v = (r; s; l) in 2VLet fv1; : : : ; vlg be the tuples in V that match2v due to the join condition J 1.if there is a tuple v0 2 V such that v0 =2 fv1; : : : ; vlg andv0 matches with 2v due to the join condition J 2thenDelete tuples v1; v2; : : : ; vl from V ;elseUpdate each tuple v1; v2; : : : ; vl in V using the speci�catio ns in U ;endifendforRETURN VEND 3Note that the OJ-deletion-refresh operator doesn't need to query the sources and hence, canbe executed very e�ciently. Change propagation equations required to propagate an OJdeletion-change table through relational and aggregate operators can be appropriately de�ned. We omitthe details here.26