[ieee 2007 ieee international conference on research, innovation and vision for the future - hanoi,...

8
Forgetting data intelligently in data warehouses Aliou Boly, Georges Hebrail Aliou Boly, Sabine Goutier Laboratoire LTCI - UMR 5141 CNRS Research and Development Ecole Nationale Superieure des Telecommunications Electricite de France Paris, France Clamart, France bolyenstfr, hebrailgenst.fr sabine.goutiergedf.fr Abstract- The amount of data stored in data warehouses grows data tend to become useless and can be replaced in data very quickly so that they can get saturated. To overcome this mining algorithms by aggregated data by day or by month. In problem, we propose a language for specifying forgetting complement to aggregated data, some samples either picked functions on stored data. In order to preserve the possibility of up randomly or chosen to be of particular interest, can be used performing interesting analyses of historical data, the t specifications include the definition of some summaries of deleted to perform Asfo ale suorjst u or expaing th Jresuts data. These summaries are aggregates and samples of deleted of analyses. As for the summaries corresponding to aggregates data and will be kept in the data warehouse. Once forgetting of detailed data, there are defined to be more and more functions have been specified, the data warehouse is aggregated as they concern older data: for instance, sums of automatically updated in order to follow the specifications. This sales in a sales data warehouse can be specified to be paper presents both the language for specifications, the structure aggregated at a daily level for data between one and three of the summaries and the algorithms to update the data months old, then at a monthly level for data between three warehouse. months and two years old, and finally at a yearly level for data older than two years. As for samples, their size is defined as a Keywords- A D fixed size: consequently there are less and less older detailed Forgettingfunctions data in each sample if the sampling is performed correctly. The specification language for forgetting function is defined in I. INTRODUCTION the context of relational databases: one specification can be Although the purpose of data warehouses is to store historical defined for each table in the database. Once specifications data, they can get saturated after a few years of operation. To have been defined, some algorithms which are presented in overcome this problem, the solution is generally to archive this paper automatically update the data warehouse according older data when new data arrive if cost of maintaining older to the specifications and fill-up the corresponding summaries. data cannot be justified economically. This solution is not Note that these algorithms can be run at any time by the satisfactory because analyses based on long term historical database administrator. Our motivation in this work is to deal data become impossible. As a matter of fact analysis of data with the problem of data warehouse saturation but all the cannot be done on archived data without re-loading them in described results are applicable to relational databases the data warehouse; and the cost of loading back a large supporting OLTP systems. All these algorithms have been dataset of archived data is too high to be operated just for one studied and programmed in a prototype upon the ORACLE analysis. The archiving of data makes them hard to manage system. The paper is organized as follows. Section II is devoted to and query efficiently. So, archived data must be considered as The prestio oreated work. Section I, we pesen th lost dta ina busness ntellience erspetive.the presentation of related work. In Section III, we present the lost data in a business intelligence perspective. In this paper, we propose a solution for solving this specification language we have defined, after a formal problem: a language is defined to specify forgetting functions definition of the 'age' of data in a data warehouse and the on older data. Two main ideas are developed: (1) the presentation of a motivating example. In Section IV, we show specifications define what data should be present in the data that the aggregate summaries we define can be stored in data warehouse at each step of time, so that differences between cubes and present the algorithms for applying forgetting actual contents of the data warehouse and the specifications functions. Section V treats the conservation of samples. are candidate data to be archived, (2) some summaries of Section VI is a conclusion to this work and draws some archived data are kept in the data warehouse so that many perspectives. analyses can still be done by using these summaries. The summaries are of two types: aggregates and samples on older data. Aggregation and sampling are two standard and Our work is first related to 'vacuuming' which is an complementary ways to summarize data. Consider for approach to physical deletion [22]. The concept of vacuuming example a data warehouse with click-stream data capturing is developed by Jensen in [17] in the context of transaction- user behaviour on web sites. As time passes, older detailed time databases: data that is older than a certain time should be 1-4244-0695-1/07/$25.00 ©2007 IEEE. 220

Upload: sabine

Post on 11-Mar-2017

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2007 IEEE International Conference on Research, Innovation and Vision for the Future - Hanoi, Vietnam (2007.03.5-2007.03.9)] 2007 IEEE International Conference on Research, Innovation

Forgetting data intelligently in data warehouses

Aliou Boly, Georges Hebrail Aliou Boly, Sabine GoutierLaboratoire LTCI - UMR 5141 CNRS Research and Development

Ecole Nationale Superieure des Telecommunications Electricite de FranceParis, France Clamart, France

bolyenstfr, hebrailgenst.fr sabine.goutiergedf.fr

Abstract- The amount of data stored in data warehouses grows data tend to become useless and can be replaced in datavery quickly so that they can get saturated. To overcome this mining algorithms by aggregated data by day or by month. Inproblem, we propose a language for specifying forgetting complement to aggregated data, some samples either pickedfunctions on stored data. In order to preserve the possibility of up randomly or chosen to be of particular interest, can be usedperforming interesting analyses of historical data, the tspecifications include the definition of some summaries of deleted to perform Asfo alesuorjst u or expaing thJresutsdata. These summaries are aggregates and samples of deleted of analyses. As for the summaries corresponding to aggregatesdata and will be kept in the data warehouse. Once forgetting of detailed data, there are defined to be more and morefunctions have been specified, the data warehouse is aggregated as they concern older data: for instance, sums ofautomatically updated in order to follow the specifications. This sales in a sales data warehouse can be specified to bepaper presents both the language for specifications, the structure aggregated at a daily level for data between one and threeof the summaries and the algorithms to update the data months old, then at a monthly level for data between threewarehouse. months and two years old, and finally at a yearly level for data

older than two years. As for samples, their size is defined as aKeywords- A D fixed size: consequently there are less and less older detailedForgettingfunctions data in each sample if the sampling is performed correctly.The specification language for forgetting function is defined in

I. INTRODUCTION the context of relational databases: one specification can beAlthough the purpose of data warehouses is to store historical defined for each table in the database. Once specificationsdata, they can get saturated after a few years of operation. To have been defined, some algorithms which are presented inovercome this problem, the solution is generally to archive this paper automatically update the data warehouse accordingolder data when new data arrive if cost of maintaining older to the specifications and fill-up the corresponding summaries.data cannot be justified economically. This solution is not Note that these algorithms can be run at any time by thesatisfactory because analyses based on long term historical database administrator. Our motivation in this work is to dealdata become impossible. As a matter of fact analysis of data with the problem of data warehouse saturation but all thecannot be done on archived data without re-loading them in described results are applicable to relational databasesthe data warehouse; and the cost of loading back a large supporting OLTP systems. All these algorithms have beendataset of archived data is too high to be operated just for one studied and programmed in a prototype upon the ORACLE

analysis. The archiving of data makes them hard to manage system.The paper is organized as follows. Section II is devoted toand query efficiently. So, archived data must be considered as Theprestio oreated work. SectionI, we pesen th

lost dta inabusness ntellience erspetive.the presentation of related work. In Section III, we present thelost data in a business intelligence perspective.In this paper, we propose a solution for solving this specification language we have defined, after a formal

problem: a language is defined to specify forgetting functions definition of the 'age' of data in a data warehouse and theon older data. Two main ideas are developed: (1) the presentation of a motivating example. In Section IV, we showspecifications define what data should be present in the data that the aggregate summaries we define can be stored in datawarehouse at each step of time, so that differences between cubes and present the algorithms for applying forgettingactual contents of the data warehouse and the specifications functions. Section V treats the conservation of samples.are candidate data to be archived, (2) some summaries of Section VI is a conclusion to this work and draws somearchived data are kept in the data warehouse so that many perspectives.analyses can still be done by using these summaries. Thesummaries are of two types: aggregates and samples on olderdata. Aggregation and sampling are two standard and Our work is first related to 'vacuuming' which is ancomplementary ways to summarize data. Consider for approach to physical deletion [22]. The concept of vacuumingexample a data warehouse with click-stream data capturing is developed by Jensen in [17] in the context of transaction-user behaviour on web sites. As time passes, older detailed time databases: data that is older than a certain time should be

1-4244-0695-1/07/$25.00 ©2007 IEEE. 220

Page 2: [IEEE 2007 IEEE International Conference on Research, Innovation and Vision for the Future - Hanoi, Vietnam (2007.03.5-2007.03.9)] 2007 IEEE International Conference on Research, Innovation

considered as inaccessible and can be removed. In the TSQL2 contribution is the definition of a language for specifying suchtemporal query language [10, 24], when a particular date is forgetting functions and keeping of summaries of archivedspecified, only data that is prior to the date should be data. These specifications - defined by the databasephysically deleted [17]. administrator - enable to automate the process of applying the

Our work is also related to work on data reduction as forgetting functions, which includes the deletion of data to bedescribed in [23] where a technique is presented for data archived (possibly linked with foreign key constraints) and thereduction that handles a gradual change of data from new update of summaries which are aggregates and samples ofdetailed data to older and summarized data in a data detailed data.warehouse. Our work offers comparable facilities but is In [3, 6, 7, 9, 11, 18], it is shown that many queries can beapplicable to relational databases instead of multidimensional answered using only samples or aggregated data instead of thedatabases. Moreover, there is no conservation of detailed data whole database. On the one hand, when aggregate summariesin their work compared to ours. are available instead of the detailed version of data, it is stillMore recently, much work has been done around the possible to answer aggregation queries with a result provided

concept of data streams where there is a strong need for as an approximation of the exact answer. On the other hand,summarizing data on a temporal perspective. For example in samples can be used to infer characteristics of a whole dataset[4, 8, 25, 32, 33], the problem of computing temporal by using sampling theory as in surveys (see [2, 27]). Moreoveraggregates over data streams is examined and it is suggested many data mining algorithms - such as decision trees forto maintain aggregates at multiple levels of granularities instance - can only consider as input aggregates over detaileddepending on time: older data are aggregated using coarser data. This justifies our approach to keep summaries ofgranularities while more recent data are aggregated with finer archived data: a complete discussion of this aspect is beyonddetail. In our work, we also develop a model for such a feature the scope of this paper.but the way data are aggregated is specified by the datawarehouse administrator instead of being controlled by the 1 SPECIFICATION OF FORGETTING FUNCTIONSstream. The specifications we define are based on the notion ofThe specifications we define for forgetting data control how 'age' of data stored in a data warehouse. This notion is defined

older data should expire and thus be deleted from the data formally in the next sub-section.warehouse. In the context of materialized views (see [9, 15, A. Notion ofage ofdata16, 19, 26]) work has been done on data expiration: in [13] itis sugse toepr.dlt)unee aeilzdve As mentioned in the introduction, our study is done in thetssuples,sthatda etoexpire(deliede veew tesmaterialized . context of relational databases. We consider that detailed datatuples, so that a set of predefined views on these materializedaretpe trdi eain ftedt aeos.Wviews can be still maintained with future updates. The are tuples stored in relations of the data warehouse. Wemotivation in this work is that data warehouses collect data assume here that each tuple is associated with a date denoted tsinto materialized views for analysis and as time evolves, the which corresponds either to the value of an attribute of thematerialized views occupy too much space and some of the relation or to a system timestamp representing the date of thedata may no long be of interest. This approach cannot be last update of the tuple. The age of a tuple data calculated atapplied to solve the problem we address in this paper. current date t, is defined to be the difference t, - t,. Both t, and

Still related to data expiration is the work described in [28]: t, can be expressed at different time unit levels defined to be:the problem addressed in this paper is to study data expiration SECOND, MINUTE, HOUR, DAY, MONTH, QUARTERin the context of historical databases which store the history of and YEAR.the different states of the database. This work is not relevant Durations like the age t, - t, can also be defined in differenthere since we are not interested in the history of database units, by convention we take:states but in historical data explicitly stored in a data 1 MINUTE =60 s[seconds],warehouse. 1 HOUR= 60*60s,When applying forgetting functions, some tuples will be 1 DAY= 24*3600 s

deleted from the data warehouse, possibly linked by some 1 MONTH= 30 DAYS = 30*24*3600 s,referential constraints. This is related to the concept of 1 QUARTER= 3*30 DAYS = 3*30*24*3600 sgarbage-collector [5, 21], which is a form of automatic 1 YEAR =365 DAYS =365*24*3600 s

memory management. The principle is to determine what dataobjets i apogra wil notbe acessd inthe utur and In order to compute the age tc - ts, we first transform tc andobjects in a program will not be accessed in the future and

tsit h EODtm ee ntwaee h ieuireclaim the storage used by those objects. In particular a t inot eS choN time thetime uitgarbage-collector must analyze relationships (links) between level used for each of them. This is done by taking the firstobjects to be deleted: we show in Section 5 that a similar instant of the period covered by the date if it is not expressedproblem appears when tuples from different tables are linked at the SECOND level.by some foreign key constraints (see [1, 12, 14, 29]).Fo intce t_sod(170/6 205>'70/6

Compare wit th reae work citd above,our mai 12:05:00' to_second('February 2006')='01/02/06 00:00:00'.

221

Page 3: [IEEE 2007 IEEE International Conference on Research, Innovation and Vision for the Future - Hanoi, Vietnam (2007.03.5-2007.03.9)] 2007 IEEE International Conference on Research, Innovation

The age is first computed at the SECOND time unit as SUMMARY TABLE Order ffollows:

USE CLIENT J OIN BY clietID;TIMESTAMP = D ate_order

- if t, is expressed at the SECOND level: HIERARCHY(G eograph: City- Depatm ent- Re giO;age = to second(t)- t, LES S THAN 1 MONTH: DETAIL;

LES S THAN 3 MONTH SUM (am ott) BY CLIENT.City, CLIEN T. S ex, DAY;- if t, is not expressed at the SECOND level: LESS THAN I YEAR: SUM (amout) BY CLIENT.Departnent MONTH,

age = to second(t) - to_second(t±+ 1) LES S THAN 5 YEAR: SUM (am out) BY CLIENT.Reo YEAR;LESS THAN 10 YEAR : SUM (amuout) BY YEAR

Secondly the age can be converted at a higher level using KEEP SAMPLE (1000) WF{ERE am ount>4000the conventions defined above, using thefloor function. EM11 S;TMARY;

In minutes: floor(age /60),In hours: floor(age /3 600), Figure 1. Motivating exampleIn days: floor(age /(24*3600)),In months: floor(age /(30*24*3600)), This forgetting function shows first that the date used toIn quarters: floor(age /(3*30*24*3600)), compute the age of each Order tuple is the value provided byIn years: floor(age /(365*30*24*3600)) the Date_order column (TIMESTAMP = Date order).

The 'LESS THAN' specifications define when detailed data

byfloor(agee(24n*3600)). (tuples from the ORDER relation) should be archived andwhich aggregates should be kept depending on the age ofThis process ensures that the duration between the last archivedates Tule hav ingan 1 MONTH

instant of the period covered by t, (for instance the period is must be ke t in the ORDER relation. Tuples older than 1one day if it is expressed at the DAY time level), and the first p pinstant of the period covered by tc is at least the age expressed MONTH can be archived but some aggregates must be keptin its unit (for instance 2 months if the age has been converted by the system:to the MONTH unit). - Tuples with an age between 1 and 3 months must be at

least described with aggregates by City, Sex and DAY.Some examples are given: Notice that City and Sex are attributes of the CLIENTtc = '20/01/06 12:34:45', t, = '20/01/06 10:50:54', tc - tS relation ag City and Sex is possie by6231~~~~~SEOD=IHU relation. Aggregation by City and Sex iS possible by6231 SECOND-i HOUR using the specification "USE CLIENT JOIN BYtc='20/02/06', ts = '20/01/06 1 lh', to_second(tc) = clientld".

'20/02/06 00:00:00', to second(ts+l) = '20/01/06 12:00:00', tc - Tuples with an age between 3 months and 1 year must be- t, 2635200 SECOND =30 DAY =1 MONTH. -Tpe lha g ewe otsadlya utbat least described with aggregates by Department and

MONTH. Aggregation by Department is possible byFinally, an ordering can be defined between ages: . A

We say age, < age2 if in second(agel) < in second(age2) using the 'Geography' hierarchy'.where in_second is a functiowhichtransfor- Tuples with an age between 1 year and 5 years must be atwhere in-second iS a function which transforms into seconds alesdscidwthaggtsbyRioanYE-. . . . . ~~~~~~~~~leastdescribed with aggregates by Region and YEAR.duration expressed in other duration time units, using the

- Tuples with an age between 5 years and 10 years must beconversion conventions described above. For example: at least described with aggregates by YEAR.30 DAY <3 MONTH < 1 YEAR.

- Tuples with an age greater than 10 years can beB. Motivating example completely forgotten.We present here a motivating example designed from a real

CRM (Customer Relationship Management) data warehouse Some semantic constraints in the language ensure that theat Electricite de France (EDF). Let CLIENT and ORDER be different 'LESS THAN' specifications are consistent: 'LESSthe following relations (tables) of the data warehouse: THAN' specifications with older age are associated withCLIENT(clientID, City, Department, Region, Sex, Salary), aggregates which can be obtained by a roll-up operation fromORDER(orderID, Date_order, amount, clientID) 'LESS THAN' specifications with younger age. Finally, the

"KEEP SAMPLE (1000) WHERE amount>4000"The first (underlined) attribute of each relation is its primary specification is a specification for keeping a random sample of

key. The CLIENT relation contains the location, sex and archived tuples. It says to keep a random sample of 1000salary of each client. The ORDER relation contains the date, tuples from the archived ORDER tuples which verify theamount and client of each order. We assume that the following condition "amount>4000".referential integrity constraint holds: from ORDER.clientlD to C. Specification language forforgettingfunctionsCLIENT.clientID. Figure 1 presents a specification of Alnug a endfndt pcf ogtigfnto

forgetingfnctioforteORDRreltion.associated with each table in a relational database. The

1 We suppose that there exists a hierarchy named Geography betweenattributes city, Department and Region of the CLIENT relation.

222

Page 4: [IEEE 2007 IEEE International Conference on Research, Innovation and Vision for the Future - Hanoi, Vietnam (2007.03.5-2007.03.9)] 2007 IEEE International Conference on Research, Innovation

grammar of this language is given in appendix. It includes e C3=(REGION, ALL, YEAR, SUM_AMOUNT)additional features, such as the discretization of numerical S4: LESS THAN 10 YEAR: SUM (amount) BY YEAR;attributes to use them as aggregation attributes, which are not e C4=(ALL, ALL, YEAR, SUM_AMOUNT)described in this paper. The different 'LESS THAN' specifications can be orderedThe focus in this paper is on the management of forgetting by age3 (ex: age of 'LESS THAN 3 MONTH' is 3 MONTH):

functions. Once some specifications have been defined, the age(S1) < age(S2) < age(S3) < age(S4).application of forgetting functions is automatic: some The cubes can also be ordered from the less to the mostalgorithms and storage structures are needed to manage the aggregated one (we use the same < notation): C1 < C2 <forgetting process. The two next sections respectively study C3 < C4. Note that C1 < C2 means that the partition ofthe maintenance of aggregates and the maintenance of detailed detailed tuples induced by cube C1 is finer than the onedata (archiving of detailed data and update of samples). induced by C24.

IV. MANAGEMENT OF FORGETTING FUNCTIONS BY B. Update ofaggregate summariesAGGREGATION The algorithms developed for updating aggregate

summaries are based on three properties:A. Storage structurefor aggregate summaries 1)madity. ag regationfctios,As shown in the example of Section 3.2, a forgetting 2) Disjointnessof cubes,

function contains several 'LESS THAN' aggregation 3) Commutation ofupdatesofaggregatesanddetaileddata.specifications. Each of these specifications indicates anaggregation level depending on the age of data. From these 1) Additivity of aggregation functions: This property is'LESS THAN' specifications, a cube scheme is derived and often assumed when building cubes. In our context, thiswill be used to store aggregates as time passes. property ensures that data belonging to a cube C2 can beA cube scheme C is a list (D1.D, TL, m), where DI. computed from a cube C1 if C1 - C2. In our example, theDn are dimension names, TL is the time dimension andM is a SUM aggregation function on measure amount is additive.

vector space of measures. With each dimension D,i is ..associated an ordered list of levels Li=4eve1s(D1), including thevalue ALL aggregating completely the dimension2. These aggregation functions or functions which can be derived fromlevels among dimensions represent hierarchies: we assume additive ones (for instance AVG can be derived from SUM

and COUNT). Note that since AVG is computed in terms ofthere iS only one hierarchy per dimension. For our example, SUM and COUNT, if a specification includes an AVGwe ORDeR = (LOCATION, SEX, TL, TOTAL AMOUNT) function then we assume it also includes the correspondingC_ORDER (LOCATION, SEX, TL TOTAL_AMOUNT) functions for SUM and COUNT. We do not further considerievels(LOCATION)=(CITY, DEPARTMENT, REGION, AVG separately in this paper.

ALL), levels(SEX) = (SEX, ALL), levels(TL) = (DAY,MONTH, YEAR, ALL) 2) Disjointness of data cubes: We have seen in the previousThen, using the cube scheme, we associate one cube of the section that 'LESS THAN' specifications can be ordered byscheme (also called cuboid in some references) with each their age. Let us consider two specifications Si- and Si'LESS THAN' aggregation specification.ALcubeTCAof agcubegschemeiCois defined by 7.1ti,M) associated with cubes Ci-l and Ci respectively. We have age(S1such thaticeLC ileveis(D,),1 1.n, tie TL. With each ) < age(S,). A tuple d from the base relation (ORDER in oursuch tht LeeT. Wh eh

example) is assumed to be counted in only one cube, definedlevel I is associated a non-empty set of values dom(l),a

representing the domain of 1 ; with each time level tl is df Coisffage(S:,) < age (d) < age(S,).associated dom(tl), representing the domain of tl.A cell c of a cube C is a tuple c = (xl,... xn, t, m), where V The consequence of this assumption is that all cubes of a

n, xic dom(i), te dom(ti) and me M. forgetting function are disjoint: no cell of a cube is includedThis gives the following cubes for our example: (functionally) in a cell of a more aggregated cube. TheSI: LESS THAN 3 MONTH: SUM (amount) BY City, Sex, different cubes of a forgetting function store disjoint periodsDAY;DAY

over time.- C1=(CITY, SEX, DAY, SUM_AMOUNT)S2: LESS THAN 1 YEAR: SUM (amount) BY Department, 3) Commutation of updates of aggregates and detailed data:

MONTH; The consequence of the preceding assumption and property isC2=(DEPARTMENT,ALL,MONTH SUM_AMOUNT) lthat updates i1ssued from the application of forgetting functions

S3: LESS THAN 5 YEAR : SUM (amount) BY Region, aeo olytoypsYEAR;

3The age of a 'LESS THAN' specification is defined by the age value2 For the time dimension, we generally define levels(TL) as (SECOND, specified just after the 'THAN' keyword.

MINUTE, HOUR, DAY, MONTH, QUARTER, YEAR, ALL) 4The is a total order because there is only one hierarchy per dimension.

223

Page 5: [IEEE 2007 IEEE International Conference on Research, Innovation and Vision for the Future - Hanoi, Vietnam (2007.03.5-2007.03.9)] 2007 IEEE International Conference on Research, Innovation

- detailed tuples from the base relation are archived and for each cube C'J = tA 1:-] -- 2)must be counted in one (unique) cube, for each cube C§ (i i - -, b

- already aggregated data may be transferred from one cube t each cli dita = Am t i in Ct xitto a more aggregated one. e t cf S

S -and S tlhe specificationis associated with cuibes C alnd CZrespectively *1

Since detailed data are only counted in one cube, it is sealcl is acceler-ated b: an index olitile time diniension* 1'equivalent to first update the cubes with newly archived data aet thie cell in cube C' overi2 Cand then operate the transfers between cubes, or to first i( is found)operate the transfers and then counting newly archived data. /f ilcremienitatiol i

We assume here that updates due to forgetting functions can fo ei h iouegite ineasnie Ailif Ml i~ C;OUI.NT or SU1MZbe applied at any time, either evenly (for instance every day) if AA . +c.Mor irregularly or even sporadically. At each time the update el, e if A4i CMNprogram is launched, new tuples of the base relation may =Mmi .Al, C.A)satisfy some conditions of aggregation specifications: they else ifA is MAXmust be archived and included into the corresponding cubes = MAX(u.hAA c.A1)(which depend on when the update program is invoked). Also, eleclieckt iftlile is aii inachtivted cel data 111 C' that caii 1i-elce Caggregated data of some cubes may be transferred to another iftit e d n ccube if they satisfy the criteria of the corresponding iffoed1)0Sjti0l .positiondspecification. Our algorithm operates with a bounded total Cfi measlre -m.nieasurespace necessary to store the different cubes. This is achieved elseby the disjointness property and by an activation/inactivation create cell data f iiito C,

process for the cube cells. When data should be transferred to inactivate cell dit in the ctbe Q.

a data cube, the cells where they were stored are not deleted Figure 2. Algorithm Transfer of data between cubesbut inactivated. So, when data should be transferred to a data We now present the algorithm for the inclusion of tuplescube, we first check if they may occupy some inactivated cells from the base relation to the cubes. Notice that the order ofof this cube, and if so, these cells are activated, processing of data cubes is not important, each tuple of detailAs a consequence of the commutation property, we X t

distinguish two update procedures which can be invoked imany order: for eachl ctbe C' = _ t)

(1) the procedure which transfers cells between cubes let c=Pdin R(base ielationl)witlh S)agedi -)(figure 2), anid agregated at levels C}

1 - . * 1 11 1 * 1 1 * 1 Ir*~~~~~we stipps i i 1 lle stl is< diau io! aiid S)-j(2) the procedure which includes newly archived data in the e Ppos tlei-e is an idex on the tii diieuiioi S Sare thle specifications associated with cuibes Ci anld C) iespectively-cubes (figure 3J. So coi--ce sp oidin to the specificatio DETAIL6 *lfoi each cell data C illC:en

In the presented algorithms, we consider that each cell in a let cell data CC= cell data of thte cube iaviing tle samie valtiescube is characterized by two fields: the position and the for (diilielisiolls of C)measure. The cell's position is the set of values of if(c is foild)

J iTlClcl-eleit-atioi *corresponding dimension levels that determine this cell. For r ementitmonfor eachi g- eate iiieasture AA in c'example, position of cell ('Paris', 'F', 'Jan 01, 06', 2000) is if is CiUNT orSUT i('Paris', 'F', 'Jan 01, 06') and the corresponding measure is C 4S=CJC4±equal to 2000. else if , is MN

C v.R = MIN(Ca-an CJa)C1S if ~MN(ishAi

In algorithm presented if figure 2, it is important to note that else if AlA i MAXit is possible that cells of a cube Ci may not satisfy the criteria elseof the following cube Cj+1 but a cube Cp such that p>i+l. For clheck if ther is ali iniactivated ce11 dIta in C5 that canl replaceexample, this may happen if the forgetting function has notbeen applied for a long period. Considering the example if& id)

CAwdedl ositionl C .posltoilpresented in Section 3.2, when the forgetting function has not clAdd.nieasuire c '.easulebeen applied for five years, the data cells of the cube elsecorresponding to the second specification5 should be directly create cell data (b into Ctransferred to the most aggregated cube. This explains why Figure 3. Algorithm for inclusion oftuples from the base relation to cubesthere are two nested loops for cubes in the algorithm.

C. Management ofreferential constraintsIn the preceding sections, we have assumed that only one

5 LESS THAN 3 MONTH: SUM(amount) BY City, Sex, DAY;

224

Page 6: [IEEE 2007 IEEE International Conference on Research, Innovation and Vision for the Future - Hanoi, Vietnam (2007.03.5-2007.03.9)] 2007 IEEE International Conference on Research, Innovation

base relation is associated with a forgetting specification. Our Notice that the size of samples is bounded by the definition ofapproach enables the definition of several specifications, each the KEEP SAMPLE specifications.of them being associated with one relation. The problem As indicated at the end of Section 2, random samples from awhich arises when defining several forgetting specifications is population can be used to infer information about a wholethat tuples from base relations may be linked together by population by only observing values available in a sample of itforeign keys. For instance, consider in our example, a relation (see [2, 27]). For instance, it is possible to infer the sum ofBILL(billID, Date_bill, amount, orderID) with the following amounts of orders by measuring it from a sample of orders andreferential integrity constraint: from BILL.billID to then calibrate this measure. Such estimations (andORDER.billID. Let us assume that the timestamp of the corresponding confidence intervals) can only be done if theforgetting function associated to the BILL relation to calculate size of the whole population is known. So it is necessary tothe age of each Bill tuple is provided by the Date-bill column store the number of archived tuples corresponding to eachand the following specifications are defined: sample. This is achieved by always defining a COUNT

measure in the storage structure of aggregate summaries (or toLESS THAN 4 MONTH: DETAIL; define one ifno aggregate summary has been specified).LESS THAN 6 MONTH: SUM(amount) BY MONTH; VI. CONCLUSION AND PERSPECTIVES

The problem appears when a tuple t1 to be archived in a We have proposed a solution for dealing with the saturationbase relation (ORDER in our example) is referenced by a problem in data warehouses and more generally in relationaltuple in another relation (BILL in our example), say t2 which databases. A language has been defined for specifying a policyis not yet archived. In our example, the Order tuples can be on how to archive data in data warehouses and keepspecified to be archived but the corresponding Bill tuples are summaries of archived data. These summaries are of twonot yet archived. This may happen when forgetting functions types: aggregates and samples. These summaries are designedare applied every month (for instance) since Bill tuples having to occupy a bounded amount of space in the database. Oncean age less than 4 MONTH must be kept in the BILL relation forgetting specifications have been defined, the algorithmswhile the Order tuples older than 1 MONTH must be specified presented in the paper update automatically both the datato be archived. We propose the following solution in this case: warehouse and the summaries. A prototype system, based ont1 is not archived but marked to be archived and will be the ORACLE database software, has been developed to provechecked to be archived in later executions of the forgetting the concept. Aggregates and samples are known to containfunction. So, the Order tuples in our example are marked to be enough information to perform many data mining analysis andarchived and they will be archived when the corresponding to answer queries in an approximate way. Our current workBill tuples are archived. Note that no further update of the focuses on the exploitation of the summaries combiningtuple will be allowed. aggregates and samples.

V. CONSERVATION OF SAMPLED DETAILED DATA APPENDIX

As seen in the example of Section 3.2, the specification The key words are in majuscule, constructions with '[' andlanguage enables to specify to keep samples of archived data ']' are facultative. The symbol '1' means other possible form.(thus deleted from the base relations). This is the KEEP The elements whose name begins by id denote anSAMPLE specification which indicates that a random sample alphanumeric sequence of characters.of fixed size is maintained whenever detailed tuples arearchived. Such samples are stored in a separate table whichhas the same structure as the base relation, with some <forgetting function>additional attributes for the management of forgetting SUMMARY TABLE <table name> {functions. <list_use_linked_table>At every application of the forgetting functions, new tuples <speciftimestamp>;

may be archived and should be considered to update the <specif discretise>;sample. For example, we suppose that a sample of 1000 <specif dietise>individuals is kept among tuples to archive and we have 100 <specif hierarchy>;new tuples to archive since the last update of forgetting <specif detail>;functions. <list_specif aggregation>;

For maintaining the sample, it is necessary to use an<

incremental technique: we use a reservoir sampling algorithm END SUMMARY;due to Waterman (see Vitter [30]). This approach enables tosample data 'on the fly', without knowing in advance thenumber of individuals in the whole population. It operates bydecreasing over time the probability of picking up a tuple.

225

Page 7: [IEEE 2007 IEEE International Conference on Research, Innovation and Vision for the Future - Hanoi, Vietnam (2007.03.5-2007.03.9)] 2007 IEEE International Conference on Research, Innovation

<table_name> ::= idAtt IQUARTERYEAR

<list_use linked table>::=<use linked table><list_use_linked_table>';'<use_linked_table> <aggregation data> := <measures> <list BY>

<measures> ::= <aggregation><use_linked_table> ::= /* empty t/ <measures>','<aggregation>

USE <table_name> JOIN BY <columns><aggregation> ::= COUNT'(' 't'

<speciftimestamp>::= TIMESTAMP =SYSTEM I<AGG>'('<column_name>')'<column_name>;

<columns> := <column_name> <AGG> := COUNT<columns>',' <column-name> SUM

<column_name> ::= idAtt MINMAX

<specif discretise> ::= /* empty J/ AVG<discretisation><specif discretise>';' <discretisation> <list BY> ::= /* empty t/

BY <list levels><discretisation> := <list levels> := <level>DISCRETISE'('<column_name>')'= <list levels>','<level>

idAtt: '('(<interval> idAtt;)+')'<interval> ::= ([])<numeric value>, <numeric value>([I]) <level> ::= [<table_name>'.']<column_name><numeric value> := ([+] -)<number> <name_discretise><number> ::= <chiffre> <time level>

<number><chiffre><chiffre> ::= 1l1 21 31 41 51 61 71 81 9 <specif KEEP>::=/* empty*/

KEEP SAMPLE '('<value>')' [WHERE<specif hierarchy> ::= /* empty */ <condition>]

<hierarchy> <condition> ::= <expression><specif hierarchy>';'<hierarchy> <condition> <op logic>

<expression> ::= [<table_name>'.']<column_name> <op><hierarchy> := HIERARCHY '('<dimension name>')' ':' <simple expression>

'('<levels hierarchy>')' <op> ::= <A >1> < =l <><dimension name> ::= idAtt <simple expression> ::= idAtt<levels hierarchy> ::= <level hierarchy> <numeric value>

<levels hierarchy> 'E' <level <op logic> ::= ORI ANDhierarchy>

<level hierarchy> :: idAttREFERENCES

<specif detail> ::= /* empty *1 [1] Akoka J., Comyn-Wattiau, Conception des bases de donnees<criteria>':'DETAIL relationnelles en pratique, Vuibert, Ed. France, 2003.

[2] Ardilly P., Les Techniques de sondage, Technip, Ed. France, 1994.<list_specif aggregation>::= <specif aggregation> [3] Babcock B., Chaudhuri S., Das G., Dynamic Sample Selection for

<list specif aggregation>';' Approximate Query Processing, SIGMOD 2003, June 9-12, San Diego,<specif aggregation> CA.

[4] Babcock B., Datar M., Motwani R., and Widom J. Models and issues indata stream systems. In Proceedings 2002 ACM Symposium Principles

<specif aggregation>: := <criteria>':'<aggregation data> of Database Systems (PODS'02), pages 1-16, Madison, WI, June 2002.<criteria> ::= LESS THAN <value> <time level> [5] Boehm H., Bounding Space Usage of Conservative Garbage Collectors,<value> ::= <chiffre> Proceedings of the 2002 ACM SIGPLAN-SIGACT Symposium on

I<value>< chiffre> Principles of Programming languages, pages 93-100, January 2002.

<time level> ::= SECOND [6] Chaudhuri S., Das G., Srivastava: Effective Use of Block-Level|MINUTE Sampling in statistics Estimation. Proceedings of the ACM SIGMODInternational Conference on Management of Data, pages 287-298, Paris,HOUR France, June 13-18, 2004.

|DAY [7] Chaudhuri s., Das G., Motwani R., Narasayya v., A Robust|MONTH Optimisation-Based Approach for Approximate Answering of

226

Page 8: [IEEE 2007 IEEE International Conference on Research, Innovation and Vision for the Future - Hanoi, Vietnam (2007.03.5-2007.03.9)] 2007 IEEE International Conference on Research, Innovation

Aggregate Queries. International Conference on Management of Data, [21] Serrano M., Boehm H., "Understanding Memory Allocation of SchemeProceedings of the 2002 ACM SIGMOD, pages 295-306, Santa Programs", Proceedings of the Fifth ACM SIGPLAN InternationalBarbara, California, United States, 2001. Conference on Functional Programming, pages 245-256, Montreal,

[8] Chen Y., Dong G., Han J., Wah B. W., and Wang J., Multi-dimensional Canada, 2000.regression analysis of time-series data streams. In Proc. 2002 Int. Conf. [22] Skyt J., Jensen C. S., and Mark L., A Foundation for VacuumingVery Large Data Bases (VLDB'02), pages 323-334, Hong Kong, China, Temporal, Data and Knowledge Engineering, volume 44, issue(1),Aug. 2002. January 2003.

[9] Chirkova R., Li C., Materializing Views with Minimal Size to Answer [23] Skyt J., Jensen C.S., Pedersen T. B., Specification-Based DataQueries, Principles of Database Systems 2003, June 9-12, 2003, San Reduction in Dimensional Data Warehouses, TimeCenter TechReportDiego, CA. TR-61, 2001.

[10] Dumas M., Fauvet M.-C., Scholl P.-C., Chapter V "Temporels Models". [24] Snodgrass R. T., The TSQL2 Temporal Query Language, KluwerIn "Bases de Donnees et Internet", G. Jomier, A. Doucet (eds.), Hermes Academic Publishers, 1995.Science publications, 2002, ISBN 2-7462-0283-2. [25] Tatbul N., Zdonik S., Window-Aware Load Shedding for Aggregation

[11] Ganti V., Lee M.L., Ramakrishnan R., ICICLES: self-tuning samples Queries over Data Streams, Proceedings of the 32nd Internationalfor Approximate Query Answering. Proc. OfVLDB, 2000. Conference on Very Large Data Bases, September 12-15, 2006, Seoul,

[12] Garcia-Molina H., Ullman J., and Widom J., Database Systems: the Korea.B'ook,Peieal 1stndWition 2 .D

[26] Theodoratos D., Xu W., Constructing search spaces for materializedComplete ' view selection. In Proceedings ACM Seventh International Workshop on[13] Garcia-Molina H., Labio W. J., Yang J., Expiring data in a warehouse, Data Warehousing and OLAP, DOLAP, pages 112-121, 2004.

Proceedings of the 24th VLDB Conference, pages 500-511, New York,USA, 1998. [27] Tille Y., Theorie des sondages, Dunod, Ed., France, 2001.

[14] Gardarin G., Bases de Donnees objet et relationnel, Eyrolles 1999. [28] Toman D., Expiration of Historical Databases.Proceedings of the Eighth International Symposium on Temporal

[15] Gupta H., Mumick I., Selection of Views to Materialize in a Data Representation and Reasoning (TIME'01): 128-135, IEEE Press, 2001.Warehouse. In IEEE Transactions on Knowledge and Data Engineering, [29] Ullman J.D., Principles of Databases and Knowledge Base Systems,TKDE, volume 17 (1/2005), pages 24-43, 2005. volume 1 and 2. Computer Science Press, 1989.

[16] Gupta A., Mumick I.S, Maintenance of Materialized Views: Problems, [30] Vitter J. S., Random Sampling with a Reservoir. ACM Transactions onTechniques, and Applications, IEEE Data Engineering Bull. 18(2): Mathematical Software, 11(1): pages 37-57, March 1985.pages 3-18, 1995.

[17] Jensen C. S., "Vacuuming", in the TSQL2 Temporal Query Language, [31] Widom J.. Special issue on materialized views and data warehousing.R. T. Snodgrass, editor, Chapter 23, pp. 451-462, Kluwer Academic IEEE Bull. on Data Engineering, 18(2), 1995.Publishers, 1995. [32] Zhang D., Gunopulos D., Tsotras V. J., and Seeger B., "Temporal and

[18] Jin R., Glimcher L., Jermaine C., Agrawal G., New Sampling-Based Spatio-Temporal Aggregations over Data Systems, vol. 28, no. 1-2,Estimators for OLAP Queries, Proceedings of the 22nd International pages 61-84, 2003. Streams using Multiple Time Granularities", JournalConference on Data Engineering, ICDE 2006, 3-8 April 2006, Atlanta, ofInformation.GA, USA. [33] Zhang D., Gunopulos D., Tsotras V. J., Seeger B., Temporal

.. . . .. ~Aggregation over Data Streams Using Multiple Granularities,[19] Paraboschi S., Sindoni G., Baralis E., Teniente E., Materialized Views in Aroegion ove Dth St reamsEUs ingM e anare,. . . T -~~~~~~~~~~~~~~Proceedings of the 8t International Conference on Extending DatabaseMultidimensional Databases. Maurizio Rafanelli (Ed.): Technology: Advances in Database Technology, pages 646-663, MarchMultidimensional Databases: Problems and Solutions, pages 222-251. 25-27, 2002.Idea Group 2003.

[20] Ramakrishman R., and Gehrke J., Database Management Systems, McGraw-Hill, third edition, 2003.

227