mining data streams under dynamicly changing resource constraints · mining data streams under...

Mining Data Streams under Dynamicly Changing Resource Constraints

Conny Franke Marcel Karnstedt Kai-Uwe SattlerDepartment of Computer Science,

University of California at Davis, USADepartment of Computer Science and Automation,

TU Ilmenau, Germany

AbstractDue to the inherent characteristics of datastreams, appropriate mining techniques heavilyrely on window-based processing and/or (ap-proximating) data summaries. Because resourcessuch as memory and CPU time for maintainingsuch summaries are usually limited, the qual-ity of the mining results is affected in differ-ent ways. Based on Frequent Itemset Miningand an according Change Detection as selectedmining techniques, we discuss in this paper ex-tensions of stream mining algorithms allowingto determine the output quality for changes inthe available resources (mainly memory space).Furthermore, we give directions how to estimateresource consumptions based on user-specifiedquality requirements.

1 IntroductionStream mining has recently attracted attention by thedatabase as well as the data mining community. The goalof stream mining is a fast and adaptive analysis of datastreams, i.e., the discovery of patterns and rules in the data.An important task of traditional mining as well as streammining is frequent itemset mining, that aims to identifycombinations of items which occur frequent in a given se-quence of transactions. Typical applications of mining datastreams are click stream analysis, analysis of records in net-working and telephone services and analysis of sensor dataamong others.

Another popular problem, especially for continuous datastreams, is the detection of changes in the data. This in-cludes aspects such as changes in the distribution of thedata (possibly expressed in statistical terms like median orquantiles) and burst detection, and also task specific changedetection, e.g., recognizing changes in the set of frequentitemsets, in the frequency of particular itemsets or changingcorrelations between itemsets. Concerning the problem ofchange detection, the challenge of in-time processing andsignalizing gains additional importance beside resource re-strictions.

The main challenge in applying mining techniques todata streams is that a stream is theoretically infinite andtherefore in most cases cannot be materialized. That meansthat the data have to be processed in a single pass using lit-tle memory. Based on this restriction and the goals of datamining one can identify two divergent objectives: On theone hand the analysis should produce comprehensive andexact results and detect changes in the data as soon as pos-sible. On the other hand the single pass demand and the

problem of resource limitations allow only to perform theanalysis on an approximation of the stream (e.g., samplesor sketches) or a window (i.e., a finite subset of the stream).

However, using an approximation or a subset of thestream affects the quality of the analysis result: The miningmodel differs from the mining model we would get if the“whole” stream or a larger subset is considered. This is par-ticularly important because some of the proposed streammining approaches support time sensitiveness (reducing theinfluence of outdated stream elements) by using weakerapproximations for outdated elements. Thus, the miningquality for these elements is worse than for newer data.The problem of the quality of the mining model can alsobe considered in the opposite direction: Based on user-specified quality requirements one could derive resourcerequirements, i.e., the memory needed for managing streamapproximations in order to guarantee the requested quality.

Mining Operators

DSMS Resources

Qualityconstraints

Resourcechanges

Changesof quality

Resourceallocation

Figure 1: Mapping of quality

Weproposeresourceawarenessin conjunc-tion withqualityawarenessas one ofthe mainrequirements in stream mining – and challenges in parallel.Fig. 1 illustrates how the dependencies between both areintegrated into the operational flow of stream processing.Two ways of putting resource and quality awareness intopractice get evident:

1. Claim for specific quality requirements and deducethe needed resources to achieve this quality.

2. Limit the resources provided for processing and de-duce the achievable quality.

Based on this observation, we propose a resource-adaptive and quality-aware mining approach for frequentitemset mining. We argue that quality awareness is basi-cally orthogonal to the specific mining problem, even ifthe individual mining approach requires dedicated tech-niques for considering mining quality. For this purpose,we discuss the general applicability of our approach andoutline specific characteristics that potential mining tech-niques have to meet.

The remainder of this paper is structured as follows. Af-ter a brief survey of related work in Section 2 we introducerelevant quality measures in Section 3. In Section 4 wepresent our extended frequent itemset mining approach byadding quality measuring as well as a resource adaptionbased on user-specified quality requirements. Based on

this, we discuss approaches for resource-adaptive changedetection in Section 4.3. After discussing the applicabil-ity of our approach to general mining problems in Sec-tion 4.4, we report results of an experimental evaluationin Section 5. Finally, we conclude the paper and discussopen issues for future work in Section 6.

2 Related WorkIn this paper, we discuss strategies for adaptively miningdata streams in general. We will develop our considera-tions on the basis of a corresponding approach for FrequentItemset Mining proposed in [Franke et al., 2005].

Frequent itemset mining deals with the problem of iden-tifying sets of items occurring together in so-called trans-actions frequently. Any itemset occurring in more thanσ (the support) percentage of all input transactions is re-garded as frequent. Usually, a deviation of ε in the supportof an itemset is accepted. σ and ε are (optionally dynam-icly) user-defined parameters. Basically, two classes of al-gorithms can be distinguished: approaches with candidategeneration (e.g., the famous apriori algorithm) as well aswithout candidate generation. Here, only the latter onesare suitable for stream mining. Usually, these approachesare based on a prefix-tree-like structure. In this tree – thefrequent pattern (FP) tree – each path represents an item-set in which multiple transactions are merged into one pathor at least share a common prefix if they share an identicalfrequent itemset [Han et al., 2000]. For this purpose, itemsare sorted as part of their transaction in their frequency de-scending order and are inserted into the tree accordingly.Again, the FP tree is used as a compact data summary struc-ture for the actual (offline) frequent pattern mining (the FP-growth algorithm).

In order to mine streaming data in a time-sensitive wayan extension of this approach was proposed [Giannella etal., 2003a]. Here, so-called tilted time window tables areadded to the nodes representing window-based counts ofthe itemsets. The tilted windows allow to maintain sum-maries of frequency information of recent transactions inthe data at a finer granularity than older transactions. Theextended FP tree, called pattern tree, is updated in a batch-like manner: incoming transactions are accumulated untilenough transactions of the stream have arrived. Then, thetransactions of the batch are inserted into the tree. For min-ing the tree a modification of the FP-growth algorithm isused taking the tilted window table into account. The orig-inal approach assumes that there is enough memory avail-able to deliver results in any required quality (expressed interms of σ and ε, see Section 3) and no way is describedhow to proceed if the algorithm runs out of memory. InSection 4.2 we will discuss how the amount of requiredmemory can be adjusted and how this affects the resultquality.

In recent time, the idea of processing streams whileadapting to resource and quality constraints gained boostedattention. There are several recent works dealing withquality-aware online query processing applications in cen-tralized and distributed systems, e.g., [Berti-Equille, 2006].But, only a couple discusses concrete approaches and algo-rithms in stream mining scenarios, least of all the conjunc-tion of resource adaptiveness and quality awareness.

In [Gaber et al., 2003] the authors propose a resource-adaptive mining approach based on similar goals as ourwork. They suggest the adaptation to memory require-ments, time constraints and data stream rate by solely

Q Output FactorsQMa ε/σ queried time intervalQMi σ -QT r maximal “look back time” -QT g minimal granularity queried “look back time”QT c minimal time till detection of changes update interval u

Table 1: Quality measures and influencing factors

adapting the produced output granularity. The paper fo-cuses on clustering, but the authors state the applicabilityto classification and frequent item mining (but not itemsetmining). In succeeding works, e.g., [Gaber and Yu, 2006],they extend their approach by adding resource adaptivenessto the input rate and the actual data mining algorithm aswell. However, the approach lacks providing quality guar-antees for the mining results.

The algorithm RAM-DS (Resource-Aware Mining forData Streams) proposed in [Teng et al., 2004] uses awavelet-based approach to control the resource require-ments. It is mainly concerned with mining temporal pat-terns and the method can only be used in combination witha certain regression-based stream mining algorithm pro-posed by the same authors. Although the proposed algo-rithm for mining temporal patterns is resource-aware, it isnot resource-adaptive and does not provide guarantees forthe quality of the mining results.

3 Quality Measures for Stream MiningAs stated in section 1, any resource adaptiveness comesalong with effects on the achievable result quality. In[Franke et al., 2005] we introduced different quality mea-sures we examine in the context of stream mining. As aresult, we distinguish several different classes of qualitymeasures, which are summarized together with exemplar-ily chosen representatives in Table 1. For a more detaileddiscussion we refer to [Franke et al., 2005].

All QT∗ are identical for different mining problems andsymbolize concrete measures, while QM∗ represent classesof quality measures that are always specific to the inves-tigated problem and the applied algorithm(s). A specialmeasure for the problem of frequent itemset mining is theratio between a value ε and the support σ, which reflectsthe maximal deviation from the defined support each of theitemsets finally identified as frequent could possess. In thecontext of frequent itemsets the support is a so called inter-esting measures QMi ([Tan and Kumar, 2000]).

From our point of view, any mining techniques appliedon continuous and evolving data streams should take timesensitiveness into account, thus, we define time as anotherimportant quality measure. QTr describes how far we canlook back into the history of the processed stream and QTg

how exact we can do this, which means which time granu-larity we can provide. QTc corresponds to one of the mainchallenges of stream mining: the actual time we need in or-der to register changes in the stream. These temporal qual-ity measures must not be confused with temporal aspectsthat influence the methodical quality (see Table 1).

For the remainder of this paper, if we refer to all qualitymeasures as a whole, we will use the symbol of the super-class Q and the general term ‘quality’.

This work aims for determining two (theoretical) kindsof functions:

1. r : Q → R - maps claimed quality to the resourcesneeded, and

2. r : R → Q - maps provided resources to the bestachievable quality, as an inverse function to r.

More detailed, r is one functionr(argsx, Qx(argsx), argsy, Qy(argsy), . . .) takingall claimed quality measures Qx, Qy, . . . and their fac-tors as input, but we write r(Q) for short. In contrast,r represents a bundle of inverse functions rx, ry, . . .,each corresponding to one quality measure Qx, Qy, . . ..Moreover, as we do not state the distribution of the streamelements as input factor for any function, r and r differwith different stream characteristics.

4 Resource-aware Mining Operators4.1 Operators for Data StreamsThe techniques proposed in this work are implemented ontop of a Data Stream Management System (DSMS) calledPIPES [Kramer and Seeger, 2004]. Rather than a mono-lithic DSMS, PIPES is an infrastructure that, in conjunc-tion with the comprehending Java library XXL [d. Berckenet al., 2001], allows for building a DSMS specific to con-crete applications with full functionality. Usually, DSMSmanage multiple continuous queries specified by operatorgraphs, which allows for reusing shared subqueries. PIPESadapts this concept and introduces three types of graphnodes: sources, sinks and operators, where operators com-bine the functionality of a source and a sink with queryprocessing functionalities. The resulting query graphs canbe build and manipulated dynamically using an inherentpublish and subscribe mechanism. This offers, among oth-ers, the possibility to adaptively optimize the processingaccording to resource awareness. PIPES provides the op-erational run-time environment to process and optimizequeries as well as a programming interface to implementnew operators, sources and sinks.

The aimed resource awareness mainly arises from im-plementing the mining techniques as separate PIPES oper-ators. But why do we implement them as operators, ratherthan, for instance, building specialized DSMS for cluster-ing and frequent itemset mining? The answer is, we wantto be able to freely choose among any combination of thesemining techniques between themselves, with other miningalgorithms and, generally, with all operators implementedin PIPES. Fig. 2 pictures a small example to illustrate thisapproach (each operator is pictured by its corresponding al-gebra symbol). The stream data produced by two sourcesO1 and O2 is processed in three ways: sink S1 receivesall clusters determined (ζ) for O1 after a preprocessing fil-ter step. S2 and S3 receive association rules determined(Φ) on joined data from O1 and O2 – this implies find-ing the frequent itemsets (φ). S3 works on the determinedrules directly, while S2 receives them after applying an-other clustering step (ζ) in order to identify interesting rulesby grouping the related ones (similar to the approach in[Toivonen et al., 1995]).

The frequent itemset mining operator takes three dynam-ically adjustable parameters:

1. The size b of a batch.2. The queried time interval [ts, te].3. An output interval o.

b represents the finest granularity of observed time and isequal to the internal update interval u. Thus, o should bea multiple of b. The resulting output is a data stream con-sisting of one stream element per passed output interval,containing the frequent itemsets found in [ts, te]. In cor-respondence to the aimed resource awareness a user candecide between two possibilities to initialize the operatorby providing:

1. Claimed qualities.2. A memory limit.

In the first case, the amount of memory is calculatedby adapting parameters inside the operator to achieve theclaimed quality. For the second case, the adapting tech-nique tries to find the ideal parameter settings to achieveoptimal quality results, if possible, while adhering to thegiven memory limit. In order to achieve this, the user mustdefine which of the supported qualities is prioritized and/orweight them accordingly.

4.2 Frequent Itemset MiningWe record two main requirements of frequent itemset min-ing techniques in order to lend themselves for our needs:they are able to provide error guarantees (frequent item-set mining on data streams usually produces approximateresults, e.g., there may be some false positives in the re-sulting output), and the approach has to be time-sensitive.The FP-Stream approach in [Giannella et al., 2003a] is ca-pable to satisfy these requirements. Asked for the frequentitemsets of a time period [ts, te], FP-Stream guarantees thatit delivers all frequent itemsets in [ts′ , te′ ] with frequency≥ σ · W , where W is the number of transactions the timeperiod [ts′ , te′ ] contains. ts′ and te′ are the time stamps ofthe used tilted time window table (TTWT) that correspondto ts and te, depending on QTg . The result may also con-tain some itemsets whose frequency is between (σ − ε)·Wand σ ·W .

Our first goal was to find out how much memory the al-gorithm needs in order to deliver results in a certain qual-ity. We conducted an extensive series of tests for the al-gorithm’s memory requirements in different parameter set-tings. Secondly, we extended the approach from [Giannellaet al., 2003a] so we can cope with limited memory, result-ing in the algorithm’s resource and quality awareness.

Estimating the amount of memoryFirstly, we estimate the amount of memory a pattern treeneeds within a given parameter setting. For estimating theoverall memory requirements of a pattern tree, we need toknow the number of nodes in a pattern tree, and the amountof memory each individual node needs. In [Giannella et al.,2003a] an upper bound is given for the size of a TTWT by2 dlog2 (N)e + 2, where N is the number of batches seenso far. This is because the algorithm uses a logarithmicTTWT to the basis 2 and is designed with one buffer valuebetween each two values except the two most recent. In ourexperiments the actual number of entries averaged over allTTWTs in a tree was always less than 70% of this upperbound. Besides the size of the TTWT, each node in a pat-tern tree needs some fixed sized memory c for storing theitemset it represents and information about its parent nodeand child nodes.

The number of nodes in a pattern tree depends on sev-eral facts. On the one hand there are the values of the al-gorithm’s input parameters ε and b. The value of σ doesnot affect the number of nodes in a pattern tree, becausethe adding and dropping conditions in a pattern tree dependonly on ε, though σ is a quality measure! On the other handthere are the characteristics of the underlying data streamlike the average number of items per transaction and thenumber of distinct items in the stream. We have not yetfound a concrete formula describing the maximum numberof nodes in a pattern tree for specific parameter settings.But, we conducted a series of tests showing that the max-imum number of nodes remains constant over time while

Sink S3

ζ

ζ

π

σ

./ φ Φ

Source O1

Source O2

Sink S1

Sink S2

query plan

φ: frequent itemsetsΦ: association rules

σ: selectionπ: projection

ζ: clustering./: join

Operators

Figure 2: Example query plan

the algorithm processes an infinite series of transactions.Thus, for certain values of ε and b and specific characteris-tics of the underlying data stream we know the maximumnumber of nodes in a pattern tree. In general, we can statethat a large pattern tree leads to mining results with betterquality than a small pattern tree.

The fact that the overall space requirements stabilize orgrow very slowly as the stream progresses was alreadyshown in [Giannella et al., 2003a]. The authors inves-tigated different values for σ and the average number ofitems per transaction. [Giannella et al., 2003b] addition-ally showed the same effect for varying values of ε. Weextended these tests to different values of the number ofdistinct items in the stream and the size of b. With a con-stant input rate the size of b affects the number of transac-tions a batch contains. As we will show in Sect. 5 the valueof b is the only one that has very little impact, as long as thenumber of transactions in a time interval of size b exceedsa certain threshold (depending on ε and some data streamcharacteristics). This is in order to fade out the effect oftemporarily frequent itemsets that have no significance forthe overall mining result.

According to Sect. 3 we can determine r as: number-of-nodes(ε, b) ∗ [2 ∗ dlog2 (N)e+ 2 + c], representing aheuristic approximation of the resources actually needed.

Dynamic tree size adjustmentIn the following we introduce an extended approach of [Gi-annella et al., 2003a] that can cope with limited memoryand uses the available memory as effective as possible. Ifthere is not enough memory available for finding frequentitemsets in the claimed quality, we need to dynamicallyadjust the size of the pattern tree according to the actualmemory conditions. At first we implemented the approachin [Giannella et al., 2003a] and examined its memory re-quirements for different parameter settings. After that, weconsidered a couple of possibilities to control the memoryrequirements of the pattern tree. Thus, we gained an ex-tended approach that has several alternatives for control-ling the tree size depending on the user’s quality weighting.We introduce a memory filling factor f defined as follows:

f = actual memory usageprovided memory . Depending on f , our approach

takes action to reduce or increase the size of the patterntree. The possibilities for manipulating the size of the treeare:

1. Adjust the value of ε while keeping σ constant, i.e.,change the approximation quality QMa of our miningresults according to available memory.

2. Adjust the value of σ according to available mem-ory while keeping ε/σ constant, i.e., keep the qual-ity QMa of the mining result constant but change thequality of interestingness QMi.

3. Limit the size of each TTWT according to availablememory, i.e., change the time range quality QTr.

4. Adjust the value of b according to available memory,i.e., change the time granularity quality QTg .

Since we can estimate the memory requirements of a pat-tern tree for a given set of parameters, we can also estimatethe maximum number of nodes our pattern tree may notexceed in order to adhere to a certain memory limit. Ineach step we assume the size of a TTWT to be at its up-per bound. The changes in this upper bound are estimatedonly at constant intervals (in our tests every 100 batches),because we want to avoid registering negligible changes ofthe maximum number of nodes after every batch.

Option 1: Adapting ε. A first option for manipulatingthe size of the pattern tree is to change the value of ε, i.e.,change QMa. Therefore, we estimate an ideal ε that re-sults in a pattern tree with approximate as many nodes aspossible, rather than taking the value of ε as an input pa-rameter. However, the user may specify a lower bound forthe value of ε, i.e., an upper bound for the quality of themining result.

Since we are not yet able to calculate the maximum num-ber of nodes for a given amount of memory exactly, we alsocannot estimate an ideal value of ε. Our algorithm adjuststhe value of ε depending on the filling factor f after everyprocessed batch as follows:• f < 0.85: Decrease ε by ten percent. Use this ε when

processing the following batches.• 0.85 ≤ f ≤ 1.0: The value of ε remains fixed.• f > 1.0: Increase ε by ten percent. Conduct tail prun-

ing at the TTWTs of each node in the pattern tree anddrop all nodes with empty TTWTs. Repeat these stepsas long as f > 1.0.

Note that f can become greater than 1. This is because weassume that we have more memory available in the systemthan the amount we provide to the pattern tree. f > 1indicates that the pattern tree consumes more memory thanwe granted to it and the tree will thus reduce its size.

In this approach we have to store the value of ε we usedin each specific time interval, in addition to monitoring thenumber of transactions in each interval. If two TTWT fre-quency entries ni and nj are merged, we also have to mergethe according ε values εi and εj in order to determine theachievable quality for this time period. We can average thetwo distinct values of ε as described by equation 1. wi andwj denote the sizes of time intervals ti and tj .

ε = (εiwi + εjwj) / (wi + wj) (1)

It was shown in [Giannella et al., 2003a] that for a fixed ε,if all itemsets whose approximate frequency is larger than(σ − ε) · W are requested, then the result will contain allfrequent itemsets in the period [ts′ , te′ ]. Thus, in our ex-tended approach all itemsets with approximate frequency(σ − ε) · W are returned. The value of ε depends on thetime intervals contained in [ts′ , te′ ]. If [ts′ , te′ ] covers onlytime intervals where the same value of ε was used for all

batches, then ε is equal to this ε. If [ts′ , te′ ] contains timeintervals where the value of ε differs, the value of ε be-comes:

ε =

e′∑i=s′

εiwi

/W (2)

In summary, in our first option we control the amount ofmemory used by varying the value of ε and so ε/σ, whichreflects the approximational quality QMa of our mining re-sult.

Option 2: Adapting σ. The second option for adjustingthe size of the pattern tree is to alter the value of σ. As σdoes not influence the size of the tree directly, ε/σ remainsthe same. That is why the user does not provide a fixedvalue of σ, but rather claims for a certain ε/σ that shouldbe guaranteed. Again, the user may also specify a lowerbound for the value of σ. The handling of σ is analog tothe handling of ε in the first option with the only differ-ence, that ε must be adjusted in parallel in order to keepε/σ constant. The analogy also applies for determining thevalue of σ:

σ =

e′∑i=s′

σiwi

/W (3)

The value of ε again results from equation 2. Returnedare all itemsets whose approximate frequency is larger than(σ − ε) · W . We guarantee to deliver all itemsets whoseactual frequency is ≥ σ. Because ε/σ is kept constant asrequested by the user, the quality QMa meets the user’s re-quirements. By adjusting the value of σ we alter the qualityQMi by varying the frequency an itemset must occur in or-der to belong to the delivered set of results.

But, what is the difference between the first and the sec-ond option? In the first option we keep σ constant andchange ε. Thus, we can request all itemsets with a mini-mum support of (σ − ε) · W and accept a poorer qualityQMa. In the second option, we modify σ but keep ε/σconstant. Thus, we also keep the user defined quality QMa

constant and return all itemsets with a minimum support of(σ − ε) ·W . In this case, only the more interesting itemsetsare found, in terms of σ as an interestingness measure fromQMi. An effect on memory usage is achieved in both op-tions. Moreover, in both options the methodology of prun-ing TTWTs remains unchanged.

Option 3: Limiting the size of the TTWTs. The thirdoption is to limit the size of each TTWT to a fixed value.This results in a restriction of how far we can look backinto the history of the processed stream, because we limitthe number of time intervals for which we store frequencyinformation. Thus, we are not able to deliver results froma time period that includes batches lying farther back inhistory than the information we recorded. According toSect. 3 this leads to an impairment of the time range qualityQTr.

As the number of time intervals stored in one entry of aTTWT increases logarithmically, saving a large amount ofmemory demands for limiting the size of a TTWT drasti-cally. In this way, the information of a considerable portionof the observed batches would get lost. Lowering the max-imal size by small values, only a small amount of memorycan be saved.

Option 4: Adapting b. The last option in order to limitthe used memory is to adjust the value of b. Assuming

a constant input rate, the size of b affects the number oftransactions a batch contains. Increasing b leads to an im-pairment of the time granularity quality QTg , as we reducehow exact we can look back. Our experiments reveal thatthe number of transactions per batch does not affect thenumber of nodes in the pattern tree significantly. By in-creasing the value of b we can only reduce the total numberof entries in a TTWT while processing a finite part of thestream. Considering infinite data streams, since the size ofa TTWT grows logarithmically, the number of entries in aTTWT will converge to the same value for all choices ofb. Therefore, this option will only have a significant impacton the required memory if it is combined with a limitationof the TTWT size. Combining both, we can reduce thegranularity of a time interval and look back farther in thehistory of the data stream using the same number of TTWTentries.

4.3 Change Detection

In Sect. 1, we briefly discussed the importance of quicklydetecting changes in data streams. In the following we willexemplarily discuss change detection on the basis of the in-troduced frequent itemset mining algorithm. We will out-line how change detection may be implemented efficientlyand, more important, resource-aware. In the next sectionwe will investigate how our approach can be mapped togeneral stream mining tasks, including change detectionas a post-processing step on the basis of concrete miningtasks as well as an independent mining problem. On the ba-sis of the extended FP-Stream approach there arise severalpossibilities in order to implement an efficient detection ofchanges in the frequent itemsets themselves, their temporaloccurrences and/or other relevant aspects. With several ofthese alternatives we even get a resource-adaptive approachfor detecting changes for free. We will sketch five differentapproaches and briefly discuss pros and cons.

From our point of view any technique for change detec-tion should meet the following criteria:• it is based on data structures which can be limited in

size, preferably in a dynamic manner• it produces realtime results, which requires an online

and incremental processing• it is capable of detecting any kind of changes that

may be of interest (i.e., frequency changes, correla-tion changes, temporal changes, and so on) and is notrestricted to the detection of specific kinds

• changes can be represented to the user in an intuitiveand understandable manner

With the problem of frequent itemset mining, a lot of so-phisticated changes in the stream (e.g., if subsets of a for-merly frequent itemset are still frequent) can be detected byanalyzing basic changes (i.e., the occurrence/omission ofsingle members of the set of all frequent itemsets). There-fore, for now we primarily focus on detecting these basicchanges.

Approaches on Separate Data StructuresA naive but simple solution is to take the output of thefrequent itemset mining operator as an independent inputstream for a separate change detection operator. Like this,information about the itemsets found in the past must bestored in some separate data structure. We investigatedthree different approaches for that:

1. store all frequent itemsets produced so far togetherwith additional information (e.g., temporal) in a table

2. store snapshots of the pattern tree in regular intervals3. store only differences between the set of all frequent

itemsets in regular intervals (similar to the incrementalbackup technique for database systems)

A main problem of all three approaches is the need for aseparate data structure. This data structure allocates extraresources, and thus, resource limits must be shared betweenthe actual mining operator and the change detection opera-tor. However, the resource-adaptive techniques introducedin this work could be applied to this data structure in orderto meet given resource limits. The first two methods allowfor detecting a wide variety of changes, even temporal ones,but are consuming by far too much memory. With the firstvariant, the task of detecting changes in one specific itemsetis complicated, because there is no information about thelocation of itemsets in the pattern tree if they are registeredafter being output from the frequent itemset operator. Thelast of the three methods is suitable for quickly detectingnew or omitted itemsets between two successive time steps– but complicates the handling of arbitrary time intervalsand the detection of changes in the frequency of itemsetsthat have already been frequent before.

Approaches on the Pattern TreeIn order to implement a resource-efficient change detec-tion, a more intuitive approach is to try to detect changesdirectly from the pattern tree used in the frequent item-set mining operator. This would need no extra memory,which is one of the main resources in our considerations.Moreover, because change detection and itemset mining iscombined into one operator, realtime signalizing is easy toachieve. Again, we distinguish two specific approaches:

1. detect changes as soon as the pattern tree is modified2. detect changes after executing the FP-growth algo-

rithm on the pattern tree (when looking for the actualfrequent itemsets after each batch)

The second approach is capable of detecting more changes,because with the first one we can only watch itemsets thatare modified in the current batch run. If a change in anitemset is only recognized during the run of FP-growth, thefirst method will not detect this change. The disadvantageof the second method is that it can only be run after pro-cessing a whole batch and has to completely traverse thetree – which leads to worse reaction time and less infor-mation about new or deleted nodes. However, in the firstmethod we have to deal with change candidates, becausenot each modification will lead to a actual change in theitemsets. Advantages of both approaches, in contrast tothose on separate data structures, are low runtime, easy de-tection of temporal changes (the TTWTs containing tem-poral information can be analyzed directly) and no extramemory consumption.

Usually, the reaction time QTc is the most importantquality measure for change detection in data streams. Theapproaches working on the pattern tree, particularly thefirst one, can provide better values for QTc than those onseparate data structures, because it does not depend on theoutput interval of the mining operator or a predefined inter-val. Rather, both techniques can be integrated right into thefrequent itemset mining operator.

In this section, we briefly presented preliminary resultswe achieved when investigating approaches for resource-aware change detection. For now, we state that the choiceof algorithm depends on the goals actually desired by theuser. Under special circumstances several of the intro-duced methods should be combined – when aiming for a

resource-aware, and moreover, resource-efficient, method,those based on the pattern tree should be preferred. Infuture works we will investigate approaches for changedetection more detailed, including the combination withresource-adaptive techniques, considerations about achiev-able qualities (mainly for QTc , but also other measures likeQTg

and QMahave to be considered) as well as change

detection methods for general mining tasks.

4.4 General Applicability of the ApproachWe are currently generalizing our approach of resource-adaptive frequent itemset mining. Our work aims at makingthe technique applicable to data stream mining algorithmswith certain characteristics in general.

To start with, there exist other algorithms for mining fre-quent itemsets in data streams that use the same approachof approximated frequencies as the FP-Stream algorithm,e.g., [Manku and Motwani, 2002; Chang and Lee, 2004].We can thus extend these algorithms in a way analog tothe modification of FP-Stream. In general, each algorithmwhose resource requirements depend on parameters like σand ε can be modified like that – with different impact onthe grade of adaptiveness.

Some of the presented methods to manipulate the size ofthe pattern tree can be applied to these algorithms withoutmajor changes to both method and algorithm. Despite theusage of different data structures in these algorithms, it ispossible to adapt the size of their synopsis by varying thevalues of σ and ε. The quality of the mining results willremain a computable value that can be guaranteed like inthe extended FP-Stream algorithm.

A majority of data stream mining algorithms uses synop-sis data structures to capture the content of the data stream.Since these synopses only represent an approximation ofthe stream’s actual content, there is always the notion ofquality associated with it. Our method can thus be appliedin principle to such stream mining algorithms as well. In[Franke et al., 2005] we already successfully applied theapproach in order to implement a resource-aware cluster-ing operator for PIPES.

5 EvaluationThe purpose of the following evaluation is twofold: at firstwe will show how we achieve the aimed quality awareness.To show this, we will evaluate how good the algorithmsachieve a claimed quality and how exact the correspondingcalculations are. In the context of the proposed stream min-ing approach quality awareness comes along with the adap-tion to provided resources and the determination of neededresources concerning aimed quality measures. This is thesecond objective of this section: we will show that memoryrequirements are approximated satisfyingly and that theyare met finally, and which conclusions we can draw to theachieved quality.

Before evaluating our quality-aware frequent itemsetmining approach we conducted a series of tests with theoriginal FP-Stream algorithm. We figured out how the al-gorithm behaves for different parameter settings of ε, σ andb. We used synthetic data generated by the IBM market-basket data generator. In the first set of experiments 1Mtransactions were generated using 1K distinct items and anaverage transaction length of 3. All other parameters wereset to their default values from the generator.

Since in the original approach a batch does not cover aconstant period of time but a constant number of transac-

0 50

100 150 200 250 300 350 400 450 500

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Avg

. num

ber o

f nod

es

Value of epsilon (in percent of sigma)

support 0.05 support 0.07 support 0.1

(a) Varying support and epsilon

100 150 200 250 300 350 400 450 500

0 5000 10000 15000 20000 25000 30000

Avg

. num

ber o

f nod

es

Number of transactions per batch

Support 0.025Support 0.05Support 0.1

(b) Varying support and batch size

150 200 250 300 350 400

0 20 40 60 80 100 120 140 4 6 8 10 12 14

1 / (

epsi

lon/

sigm

a)

Number of batchesactual number of nodes

maximum number of nodes1 / (epsilon/sigma)N

umbe

r of n

odes

in p

atte

rn tr

ee

(c) Resource awareness

Figure 3: Average number of nodes in a pattern tree

tions, we set the size of a batch to 5000 transactions. Fig-ure 3(a) shows some of our experimental results with theoriginal algorithm. We measured the average number ofnodes in a pattern tree for three different values of σ. Foreach σ we run the algorithm with various ratios between εand σ to simulate different quality demands. As expectedthe number of nodes decreases with rising ε and σ.

Figure 3(b) shows results of another series of tests weconducted with the original algorithm. We processed thetest data with different values of σ (always setting ε = 0.1 ·σ) and with various batch sizes. The number of nodes in thepattern tree does not decrease significantly when the batchsize is highly raised.

We also processed test data having more average itemsper transactions (5, 10) and/or more distinct items (5K,10K) and we additionally tried different batch sizes (1000to 30000). The main conclusion is always as stated above.One remarkable thing we noticed is that for small batchsizes the number of nodes in a pattern tree is far from beingconstant. For example, when processing testing data with1M transactions, 1K distinct items and average transactionlength of 3 we set σ = 0.025, ε = 0.1 · σ and the batchsize to 1000 transactions. The number of nodes in the re-sulting pattern tree oscillated between under 1000 to nearly5000. When processing the same data set with higher val-ues of σ, ε or batch size this range got significantly smaller.For a support of 0.07 the difference between the minimumand the maximum value of the number of nodes was 45.Remainders of this effect can be seen in Figure 3(b) forσ = 0.025 by the sharp decrease of the number of nodesfor low batch sizes.

For evaluating the quality-aware frequent itemset min-ing approach, we again generated data sets with 1M trans-actions using 1K distinct items and an average transactionlength of 3. Our algorithm consumed the stream of trans-actions from a source with a constant output rate of 3 sec-onds. The value b of the finest granularity of time was setto 15000 seconds, so each processed batch contained 5000transactions. The value of σ was set to 0.05.

Firstly, we wanted to demonstrate that our approach cancope with changing memory conditions. Instead of limit-ing the actual amount of available memory we limited thenumber of nodes that the pattern tree may have. One couldcalculate the actual amount of memory needed, since themaximum size of a TTWT can be estimated as described inSection 4.2.

Initially we limited the maximum number of nodes thegenerated pattern tree may have to 340. As we do not havea formula yielding the maximum number of nodes in a pat-tern tree for a given set of parameters, we had to choose anadequate value of ε for the algorithm. Our previous exper-iments showed, that ε = 0.1 · σ = 0.005 would be a goodvalue to start.

We started the algorithm and decreased the maximumnumber of nodes after 40 batches to 200 nodes. We thenraised it again after 40 batches to a value of 400 nodes. Fig-

ure 3(c) shows the algorithm’s behavior. After 40 batches,it had to raise the value of ε several times to get its fillingfactor below 1. Then, after the next 40 batches, the algo-rithm lowered the value of ε after every processed batchuntil a tree size was reached that was close enough to themaximum according to the filling factor. Figure 3(c) alsodisplays the quality of the stream mining for every batch,i.e., the value of 1/ (ε/σ).

0 5

10 15 20 25 30 35

Num

ber o

f ite

mse

ts(sigma − averaged eps)*W

sigma*W 26000

Actual frequency of the itemset

Figure 4: Quality awareness

Withour secondseries oftests of ourquality-awareapproachwe demon-strate thequality ofour miningresults. Westarted thealgorithm

using option 1 of our proposed techniques for adjustingthe size of the pattern tree, i.e., we change the value of εwhen necessary. The experimental settings are equal tothese of the above test. Only this time we set σ = 0.025and ε = 0.2 · σ to get any frequent itemsets at all. Theexperiments show that we estimated the overall quality ofour mining results ranging over several batches (and thusseveral values of ε) correctly and that it is in fact suitableto average distinct values of ε as described.

First we run our approach once to get the value of ε,which was approximately 0.00765. Then we used the orig-inal FP tree algorithm to get the actual set of frequent item-sets from our test data. We also got the actual frequen-cies of all itemsets having frequencies between (σ − ε)and a little less and the required support. Then, we ranour quality-aware approach and asked for the set of fre-quent itemsets after the 120th batch. We set the time pe-riod [ts, te] to the whole period of time the stream wasprocessed so far. So the window we requested containedW = 600000 transactions. After we received the result,we compared the output to what we knew from the FP-growth method. For each itemset our approach delivered,we looked up its real frequency and printed these infor-mation in the histogram shown by Figure 4. All itemsetsour approach output were inserted in frequency buckets,according to their real frequency which we gained throughusing the FP tree algorithm.

Figure 4 shows that the output contained no itemsetshaving frequency less than (σ − ε) · W . The histogramalso shows that exactly 23 itemsets having frequency of atleast σ were delivered. These itemsets are the same as FP-growth delivered.

All in all the experiments met our expectations. In[Franke et al., 2005] we also conducted similar experi-

ments for a resource-aware clustering strategy, which em-phasize the results showed here. The approximations ofmemory usage hold, the quality deduced from the availableresources is close to the actually achieved quality. This istrue for all examined quality measures, as far as our testscan show that. Of course, we have to do a couple of ex-tended test series, including different parameter settings,varying stream characteristics and a deeper analysis of sev-eral (classes of) quality measures. With the results of thesesubsequent tests we could finally demonstrate the qualityand resource awareness already achieved in this work.

6 ConclusionIn this paper, we argued that quality of analysis results isan important issue of mining in data streams. The rea-son is that stream mining can be usually performed onlyon a resource-limited subset or approximation of the en-tire stream, which affects different measures of data quality.Based on specific quality measures for stream mining, weinvestigated and enhanced a frequent itemset mining tech-nique in order to estimate the quality depending on the cur-rent resource situation (mainly the available memory) aswell as to allocate resources needed for guaranteeing user-specified quality requirements. Furthermore, we gave di-rections for making a whole class of stream mining algo-rithms resource- and quality-aware, including complemen-tary tasks such as application specific change detection.

Beside these goals and the mentioned considerationsabout resource-aware approaches for change detection indata streams we will also apply the introduced techniquesto other mining primitives. Based on the earned experi-ences about the method’s practical applicability and the re-flections on different quality measures we plan to build aformal framework for resource-adaptive stream mining un-der quality guarantees. This includes the definition of theconcrete quality/resource functions r and r.

From our point of view, other resource requirements thanmemory consumption of stream synopses have to be re-garded as well. For instance, beside the memory requiredby the pattern tree itself, the algorithm needs to have ad-ditional memory available during runtime. This memoryis used for the actual computations and for storing twobatches, the one that is currently processed as well as theone that is currently build from the newly arriving trans-actions. When considering the computations done by theFP-Stream algorithm, we note that the FP-growth methodused to determine the batch’s frequent itemsets is very ex-pensive in terms of memory requirements. Using a morememory efficient FP-growth method as proposed and im-plemented by Ozkural and Aykanat [Ozkural and Aykanat,2004] would lead to decreased overall memory require-ments of the algorithm.

Moreover, in addition to memory awareness we will takethe algorithms’ runtime into account. In this context severaladditional aspects have to be considered, like the stream-ing rate of the incoming data and the runtime overhead im-posed by our extension. In addition, we will have to dealwith the conflict that adding resource adaptiveness to analgorithm imposes a runtime and memory overhead itself.

Finally, we have addressed only operator-locally param-eters like the amount of memory available for this specificoperator. In future work, we plan to take also global prop-erties of the whole analysis pipeline into account, e.g. loadshedding and windowing operators, which have an impacton the output quality, too.

References[Berti-Equille, 2006] L. Berti-Equille. Contributions to

Quality-Aware Online Query Processing. IEEE DataEng. Bull., 29(2):32–42, 2006.

[Chang and Lee, 2004] J. H. Chang and W. S. Lee. ASliding Window Method for Finding Recently FrequentItemsets over Online Data Streams. Journal of Informa-tion Science and Engineering, 20(4):753–762, 2004.

[d. Bercken et al., 2001] J. V. d. Bercken, B. Blohsfeld, J.-P. Dittrich, J. Kramer, T. Schafer, M. Schneider, andB. Seeger. XXL - A Library Approach to Supporting Ef-ficient Implementations of Advanced Database Queries.In VLDB 2001, pages 39–48, 2001.

[Franke et al., 2005] C. Franke, M. Hartung, M. Karnst-edt, and K. Sattler. Quality-Aware Mining of DataStreams. In IQ, 2005.

[Gaber and Yu, 2006] M. M. Gaber and Ph. S. Yu. Aframework for resource-aware knowledge discovery indata streams: a holistic approach with its application toclustering. In SAC’06, pages 649–656, 2006.

[Gaber et al., 2003] M. M. Gaber, Sh. Krishnaswamy, andA. Zaslavsky. Adaptive mining techniques for datastreams using algorithm output granularity. In The Aus-tralasian Data Mining Workshop, 2003.

[Giannella et al., 2003a] C. Giannella, J. Han, J. Pei,X. Yan, and P. S. Yu. Mining Frequent Patterns in DataStreams at Multiple Time Granularities. In Workshop onNext Generation Data Mining, 2003.

[Giannella et al., 2003b] C. Giannella, J. Han, E. Robert-son, and C. Liu. Mining Frequent Itemsets over Arbi-trary Time Intervals in Data Streams. Technical report,Indiana University, 2003.

[Han et al., 2000] J. Han, J. Pei, and Y. Yin. Mining Fre-quent Patterns without Candidate Generation. In SIG-MOD 2000, Dallas, USA, pages 1–12, 2000.

[Kramer and Seeger, 2004] J. Kramer and B. Seeger.PIPES - A Public Infrastructure for Processing and Ex-ploring Streams. In SIGMOD 2004, pages 925–926,2004.

[Manku and Motwani, 2002] G. S. Manku and R. Mot-wani. Approximate Frequency Counts over DataStreams. In VLDB 2002, Hong Kong, China, pages 346–357, 2002.

[Ozkural and Aykanat, 2004] E. Ozkural and C. Aykanat.A Space Optimization for FP-Growth. In ICDMWorkshop on Frequent Itemset Mining Implementations,2004.

[Tan and Kumar, 2000] P. Tan and V. Kumar. Interesting-ness Measures for Association Patterns: A Perspective.Technical report, University of Minnesota, 2000.

[Teng et al., 2004] W.-G. Teng, M.-S. Chen, and Ph. S.Yu. Resource-aware mining with variable granularitiesin data streams. In SDM 2004, 2004.

[Toivonen et al., 1995] H. Toivonen, M. Klemettinen,P. Ronkainen, K. Haton, and H. Mannila. Pruning andgrouping discovered association rules. In ECML Work-shop on Statistics, Machine Learning and KnowledgeDiscovery in Databases, pages 47–52, 1995.

mining data streams under dynamicly changing resource constraints · mining data streams under...

Documents