fine-granularitysignature caching in object database systems · stored in an object database system...

23
Fine-Granularity Signature Caching in Object Database Systems Kjetil Nørv˚ ag Department of Computer and Information Science Norwegian University of Science and Technology, 7491 Trondheim, Norway Abstract In this paper, we present the SigCache approach. In contrast to traditional signature files where signatures are stored in separate files, signatures are in our approach stored together with their objects. In addition, the most frequently accessed signatures are stored in a main memory signature cache (SigCache). When using the signatures stored in the SigCache as a filter during perfect match queries, the number of objects that actually have to be retrieved during perfect match queries can be reduced. The increase in update cost is insignificant, because the signatures are much smaller than the objects. Key words: object database systems, signature files, buffer management, query processing 1 Introduction In many of the emerging application areas for database systems, data is viewed as a collection of objects and the access pattern is navigational. A typical example of such an application area is XML/Web storage. XML data should preferably be stored in an object database system (ODB) [12], but it can also be stored in a re- lational database system or an object-relational database system. Another typical characteristic of the new applications is that a large fraction of the accesses are perfect match accesses on one or more words in text strings in the objects/tuples in these databases. For such accesses, signature files can be used to reduce the query cost. E-mail: [email protected] A signature is a bit string, which is generated by applying some hash function on some or all of the attributes of an object. Note that signatures are also often used in other contexts, e.g., function signatures and implementation signatures. Preprint submitted to Data and Knowledge Engineering 14 September 2001

Upload: others

Post on 18-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

Fine-Granularity Signature Caching in ObjectDatabase Systems

Kjetil Nørvag�

Departmentof ComputerandInformationScienceNorwegian University of ScienceandTechnology, 7491Trondheim,Norway

Abstract

In this paper, we presentthe SigCache approach.In contrastto traditionalsignaturefileswheresignaturesarestoredin separatefiles,signaturesarein ourapproachstoredtogetherwith theirobjects.In addition,themostfrequentlyaccessedsignaturesarestoredin amainmemorysignature cache(SigCache).Whenusingthesignaturesstoredin theSigCacheasafilter duringperfectmatchqueries,thenumberof objectsthatactuallyhaveto beretrievedduringperfectmatchqueriescanbereduced.The increasein updatecostis insignificant,becausethesignaturesaremuchsmallerthantheobjects.

Key words: objectdatabasesystems,signaturefiles,buffer management,queryprocessing

1 Introduction

In many of theemerging applicationareasfor databasesystems,datais viewedasa collectionof objectsand the accesspatternis navigational.A typical exampleof suchanapplicationareais XML/Webstorage.XML datashouldpreferablybestoredin anobjectdatabasesystem(ODB) [12], but it canalsobe storedin a re-lational databasesystemor an object-relationaldatabasesystem.Another typicalcharacteristicof the new applicationsis that a large fraction of the accessesareperfectmatchaccessesononeor morewordsin text stringsin theobjects/tuplesinthesedatabases.For suchaccesses,signaturefiles

�canbeusedto reducethequery

cost.�E-mail:[email protected]�A signatureis abit string,whichis generatedby applyingsomehashfunctiononsomeor

all of theattributesof anobject.Notethatsignaturesarealsooftenusedin othercontexts,e.g.,functionsignaturesandimplementationsignatures.

Preprintsubmittedto DataandKnowledgeEngineering 14September2001

Page 2: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

Themaindrawbackof traditionalsignaturefiles is thatsignaturefile maintenancecan be relatively costly. If one of the attributescontributing to the signatureinan object (or a tuple) is modified, the signaturefile has to be updatedas well.To be beneficial,a high readto write ratio is necessary. This is alsothe casefordynamicsignaturefiles. In addition, high selectivity is neededat query time tomake it beneficialto readthesignaturefile in additionto theobjectsthemselves.

In this paper, we presenttheSigCacheapproach.Insteadof storingthesignaturesin separatesignaturefiles, thesignaturesarestoredtogetherwith theobjects,andthemostfrequentlyaccessedsignaturesarein additionstoredin a signaturecache(SigCache).A signatureis in generalmuchsmallerthananobject,sothatthenum-berof signatureswe cankeepin theSigCacheis muchhigherthanthenumberofobjectswe canstorein the main memorybuffer. Whenan object is updatedandthesignatureis storedon thesamepage,theextra insertcostis only marginal.Thesignaturescanalsobeusedto reducetheCPUcostwhentheobjectsarealreadyinmemory.

In this paper, we describein detail theuseandmaintenanceof theSigCache,andanalyzeits performanceby theuseof costfunctions.As for all accessmethods,thegaindependson accesspatterns.We show that the gain from usingthe SigCacheapproachis significantfor mostaccesspatterns.

Thediscussionhereis donein thecontext of ODBs,but theapproachis alsoappli-cablefor relationaldatabasesystemsandobject-relationaldatabasesystemsexperi-encinga navigationalaccesspattern.We assumea traditionalclient/serversystem,but it shouldalsobe notedthat this approachcouldprove to beevenmoreusefulin the context of mobile databases,wherereductionamountof datato be trans-ferredbetweenclient andserver hasevenmoreimpacton performancethanit hasin traditionalsystems.

The organizationof the rest of the paperis as follows. In Section2 we give anoverview of relatedwork. In Section3 we give an exampleof an applicationinwhichsignaturescouldbeemployedto improveperformance.In Section4 wegiveabrief introductionto signatures.In Section5 wedescribetheSigCacheapproach.In Section6 we develop a cost model,which we usein Section7 to study theperformancewhenaSigCacheis used.Finally, in Section8, weconcludethepaperandoutlineissuesfor furtherresearch.

2 Related work

Severalstudieshavebeendoneonusingsignaturesasatext accessmethod,e.g.[2,5,6,9].Lesshasbeendonein usingsignaturesin ordinaryqueryprocessing,but signaturefile techniqueshavebeenshownto bebeneficialin queriesonset-valuedobjects[7].

2

Page 3: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

<guide><restaurant><name>La Coupole</name><address>102 Bd du Montparnasse, 75015 Paris, France</address><metro>Vanves</metro><averageprice>189 FF</averageprice><review>

Good food, food value for money. Famous for be-ing one of the

favourite places for writers like Miller, Heming-way, Sartre, and

de Beauvoir.</review>

</restaurant></guide>

Fig. 1. Exampleof oneelementin therestaurantguide.

We have in a previous paperdescribedhow signaturescanbe usedin ODBs thatuselogicalOIDs[10]. In anODB usinglogicalOIDs,anOID index is usedto mapfrom logical OID to thephysicallocationof theobject,andthesignaturescanbestoredtogetherwith themappinginformationin this index. Theproblemwith thisapproach,is thattheOID index hasto beupdatedfor everyobjectupdate,makingitmostsuitablefor relatively staticdata,or for temporalODBswheretheOID indexis updatedoneveryobjectupdate.

Also relatedto thework in this paperis XML extensionsto commercialdatabasesystems(for exampleOracleandIBM DB2). Thesesystemscurrentlyusetext in-dexes(providedby text extenders)to speedupqueries.Therearealsosystemsstor-ing XML dataasobjects,for exampleTamino[11] andXyleme [1]. They shouldbeableto benefitfrom thetechniquesdescribedin this paper.

3 Motivating example

In this sectionwe describeanapplicationandanexampleof a typicalquerywherethe techniquespresentedin this papercanbe employed to improve performance.Theexampleis inspiredby therestaurantguideexamplein [4].

In mostbig cities,goodrestaurantsarepopular, andreservation is oftennecessaryin orderto geta table.For foreign touristsplanninga trip to the city, it would benice to have a centralbookingserviceavailableon Web,which could be usedtohelpselectingrestaurantsaswell asdoing thebooking(trying to do a reservationin a city in a differentcountrycanbe difficult, andlanguageproblemsmakesthejob evenworse).

3

Page 4: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

select Rfrom guide.restaurant Rwhere R.address.contains("Paris")

and R.review.contains("famous")and R.review.contains("good")and R.review.contains("food")

Fig. 2. Queryon restaurantguide.We assumecontains("word") is a functionthatissupportedby astringsclassandreturnstrueif theactualword is includedin thestring.

Let usassumea company thatwantto providesucha service.Thecompany mightwantto providerestaurantguidesasXML documentsontheWeb,onefor eachcity.TheXML documentsconsistsof restaurantelements,which eachincludesa nameelement,addresselement,price element,reviewer comment,and possiblyotherinformationaswell. An examplerestaurantelementis shown in Fig. 1. AlthoughpostedasXML documentsontheWeb,it mightjustaswell bestoredasacollectionof objectsin the database,and the XML documentgeneratedfrom this data.Inthis case,the guide could be implementedas one objects,being a collection ofrestaurantobjects.The restaurantobjectsthemselvescould includetext attributeslikename,address,etc.

In additionto providing thelist “as is” on theWeb,thesystemwould alsoprovidepossibilitiesfor usersto doqueriesto theguide,in orderto guidethemin selectingan appropriaterestaurant.This will be donein someeasy-to-useinterface,whichtranslatesthe queryto the internalquerylanguage.An exampleof suchan queryis “list restaurantsin Paris which are famousand have good food”. In a querylanguagethiscouldbeformulatedasshown in Fig. 2. Thepoint to notehereis thatall theaccessesin thequeryareperfectmatchaccesses(PMA), testingstringsformatchwith singlewords.

4 Signatures

In this sectionwe describehow to generatesignatures,how to usesignaturestoreducethecostof perfectmatchaccesses,andsignaturestoragealternatives.

4.1 Signaturegeneration

A signaturecanbe generatedby applyinga hashfunction on someor all of theattributesof the object.By applying this hashfunction, we get a signatureof �bits. If we denotetheattributesof anobject ��� as � �� � �������� � ��� , thesignatureoftheobjectis ��������������� ������� ���� , where��� is ahashvaluegeneratingfunction,and��� ������� ��� aresomeor all of theattributesof theobject(not necessarilyincluding

4

Page 5: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

all of ��� ������� ��� ).It is possibleto generatethesignaturefrom thehashvalueof theconcatenationofoneor moreof theattributes.However, suchsignaturescanonly beusedfor querieson thesamesetof attributesthatwereusedto generatethesignature.For this pur-pose,usingan index will in many caseshave a lower cost.In orderto be abletosupportseveralquerytypes,thatdo perfectmatchon differentsetsof attributes,atechniquecalledsuperimposedcodingcanbeused.In thiscase,aseparateattributesignatureis generatedfor eachattribute.Theobjectsignatureis generatedby per-forming a bitwiseOR on eachattributesignature.For example,for anobjectwith3 attributesthesignatureis ����� ���!�"��# � OR ������� � � OR ������� � � . This resultsin asignaturethat is very flexible in use.It cansupportseveral typesof queries,withdifferentattributes.

It is not necessaryto useall attributeswhencreatingthesuperimposedsignature.If we know thatonly $ of theattributeswill be frequentlyusedin PMAs, we cangeneratethesignaturefrom theseattributesonly. In thecaseof stringattributes,forexamplein anXML object,wewill generateaseparatesignaturefrom eachdistinctword in thestringattributes,andsuperimposethesewordsignatures.$ will in thiscasebethenumberof distinctwordsin theobject.

4.2 Usingsignatures

A typical exampleof the useof signatures,is a queryto find all objectsin a col-lection of objectswherethe attributesmatcha certainnumberof values,i.e., thequery predicateis % � �"���&� '�� ��������� ���(� ')�� . This can be doneby calcu-lating the querysignature��* of the query: ��*+�,���!�"���-�.'�� ������� � ���/�.'0��� (or��*1�2�����"'��3� OR ������'4�"� OR ����� OR ������')�� if superimposedcoding is used)Thequerysignature��* is thencomparedto all thesignatures��� in thesignaturefile tofind possiblematchingobjects.A possiblematchingobject,adrop, is anobjectthatsatisfiestheconditionthatall bit positionssetto 1 in thequerysignature,alsoareset to 1 in the object’s signature.The dropsforms a setof candidateobjects.Anobjectcanhave a matchingsignatureevenif it doesnot matchthevaluessearchedfor, soall candidateobjectshave to beretrievedandmatchedagainstthevaluesetthatis searchedfor. Thecandidateobjectsthatdonotmatcharecalledfalsedrops.

4.3 Signaturefiles

Traditionally, thesignatureshave beenstoredin oneor moresignaturefiles, sepa-ratefrom theobjects/tuples.Thefilescontain��� for all objects��� in therelevantset.Thesizesof thesefiles arein generalmuchsmallerthanthesizeof therelation/setof objectsthat thesignaturesaregeneratedfrom, anda scanof thesignaturefiles

5

Page 6: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

5 6 7 8:9 ; <>= ?5 6 7 8>9 ; <:= ?@BA C ? D ;

@BA C ? D ;5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8:9 ; <>= ?E 6 F GIH>9 7J?

@KA C ? D ;@KA C ? D ;

@BA C ? D ;5 6 7 8:9 ; <>= ?

5 6 7 8>9 ; <:= ?@BA C ? D ;@BA C ? D ;5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8:9 ; <>= ?E 6 F GIH>9 7J?

@KA C ? D ;@KA C ? D ;

@BA C ? D ;

5 6 7 8:9 ; <>= ?5 6 7 8>9 ; <:= ?@BA C ? D ;

@BA C ? D ;5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8:9 ; <>= ?E 6 F GIH>9 7J?

@KA C ? D ;@KA C ? D ;

@BA C ? D ;

H>9 7J?BL><:M M ? = 5 6 7 8:9 ; <>= ?5 6 7 8>9 ; <:= ?@BA C ? D ;

@BA C ? D ;5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8:9 ; <>= ?E 6 F GIH>9 7J?

@KA C ? D ;@KA C ? D ;

@BA C ? D ;

5 6 7 8:9 ; <>= ?5 6 7 8:9 ; <>= ?5 6 7 8:9 ; <>= ?5 6 7 8:9 ; <>= ?5 6 7 8:9 ; <>= ?5 6 7 8:9 ; <>= ?5 6 7 8:9 ; <>= ?5 6 7 8:9 ; <>= ?5 6 7 8:9 ; <>= ?5 6 7 8:9 ; <>= ?5 6 7 8:9 ; <>= ?5 6 7 8:9 ; <>= ?5 6 7 8:9 ; <>= ?5 6 7 8:9 ; <>= ?5 6 7 8:9 ; <>= ?5 6 7 8:9 ; <>= ?

5 6 7 8>9 ; <:= ?5 6 7 8>9 ; <:= ?5 6 7 8>9 ; <:= ?5 6 7 8>9 ; <:= ?5 6 7 8>9 ; <:= ?5 6 7 8>9 ; <:= ?5 6 7 8>9 ; <:= ?5 6 7 8>9 ; <:= ?5 6 7 8>9 ; <:= ?5 6 7 8>9 ; <:= ?5 6 7 8>9 ; <:= ?5 6 7 8>9 ; <:= ?5 6 7 8>9 ; <:= ?5 6 7 8>9 ; <:= ?5 6 7 8>9 ; <:= ?5 6 7 8>9 ; <:= ?

5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?5 6 7 8>9 ; <>= ?

5 6 7 8:9 ; <>= ?ONP9 D Q:?

R 9 6 8 R ? S�T = UFig. 3. Storageof objectpagesandsignaturesin mainmemory. Pagebuffer to theleft, andsignaturecache(SigCache)to theright.

is muchcheaperthana scanof the whole relation/setof objects.Two well-knowstoragestructuresfor signaturesareSequentialSignatureFiles(SSF)andBit-SlicedSignatureFiles (BSSF).

In the simplestsignaturefile organization,SSF, the signaturesarestoredsequen-tially in a file. A separatepointerfile is usedto provide themappingbetweensig-naturesandobjects.In anODB, thepointerfile will typically bea file with OIDs,onefor eachsignature.During eachsearchfor perfectmatch,thewholesignaturefile hasto beread.Updatescanbedoneby updatingonly oneentryin thefile.

With BSSF, eachbit of the signatureis storedin a separatefile. With a signaturesize � , thesignaturesaredistributedover � files, insteadof onefile asin theSSFapproach.This is especiallyusefulif wehavelargesignatures.In thiscase,weonlyhaveto searchthefilescorrespondingto thebit fieldswherethequerysignaturehasa “1”. Thiscanreducethesearchtimeconsiderably. However, eachupdateimpliesupdatingup to � files,which is very expensive.So,evenif retrieval costhasbeenshownto bemuchsmallerfor BSSF, theupdatecostismuchhigher, 100-1000timeshigheris not uncommon[7]. Thus,BSSFbasedapproachesaremostappropriatefor relatively staticdata.

To bettersupportinsertions,deletions,andupdates,severaldynamicsignaturefilemethodshave beenproposed.Thesearemultiway treevariantsandhashfile vari-ants.

5 The SigCache approach

An alternative to storethe signaturesin separatesignaturefiles is to storethemtogetherwith the objectson the objectpages,andin additionstorethe most fre-quentlyaccessedsignaturesin a signature cache (SigCache).This is illustratedin

6

Page 7: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

Fig. 3. The SigCacheis a lookup tablewherereplacementof signaturesis doneaccordingto an LRU policy. In this paperwe assumethata clock algorithmwithoneaccessbit for eachsignatureis usedfor this purpose.

A signatureis storedon thesamepageasits object.This implies that if anobjectis residentin main memory(i.e., the pagewherethe object residesis residentinthepagebuffer), its signaturewill alsoberesident,becauseit is storedin thesamepage.However, theoppositeis not true:asignaturecanberesidentin theSigCacheeventhoughits objectis not residentin thebuffer. Whena pageis discardedfromthepagebuffer, signaturesof someof theobjectson thepagecanberesidentin theSigCache.

Perfectmatchaccessescanusethesignaturesto reducethenumberof objectsthathave to be retrieved from disk. Only the candidateobjects,with matchingsigna-tures,needto be retrieved. In order to identify a signaturein the SigCacheeachsignatureneedto have an uniqueidentifier. In an ODB this will be the OID, in arelational-or object-relationaldatabasesystemthis canbea physicaltuple identi-fier or theconcatenationof relationandkey.

A signatureis in generalmuchsmallerthananobject,sothatthenumberof signa-tureswe cankeepin theSigCacheis muchhigherthanthenumberof objectswecanstorein the main memorybuffer. The fact that the effective spaceutilizationin an objectpagebuffer is low becauseof badclusteringon objectpagesfurtherincreasetheamountof relevantsignaturesrelative to relevantobjects.Theoptimalsizeof theSigCachedependson accesspatternandtotal buffer sizerelative to thetotaldatabasesize,but aSigCachesizein theorderof 5-10%of thetotalbuffer sizewill givearelatively stableperformance(thiswill bediscussedin moredetaillater)for awide rangeof workloadparameters.

Signaturesarenot maintainedfor all objectsin the system,only when it is ben-eficial. This can for examplebe decidedon the granularityof an objectclassorobjectcontainer(alsocalledfile). Evenwhensignaturesexist, they areonly usedin aqueryif it is consideredbeneficialwith respectto querycost.This is similar totraditionalsecondaryindexes,whereanindex is createdandmaintainedonly if it isconsideredbeneficial(this is decidedby thedatabaseadministrator),andtheindexis only usedin a queryif it is consideredbeneficial(this is decidedduring queryplanning).

Notealsothatevenif storingsignaturestogetherwith theobjectsreducesthespacefor objectson a page,this reductionis only marginal,astheoptimalsizeof signa-tureswhenusingtheSigCacheapproachis relatively small (typically in theorderof 3-6%).

7

Page 8: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

5.1 Theadvantagesof theSigCacheapproach

TheSigCacheapproachhasmany advantages.Themostimportantare:

V Traditionally, signatureshave only beenbeneficialfor relatively staticdata(i.e.,a low updaterate),for exampletext documents.Evenwhendynamicsignaturefilesareused,theupdateratemustbelow if theuseof signaturesshouldimproveperformance.Dynamicsignaturefiles alsohave higherspacerequirementsandsearchcostcomparedto SSFandBSSF.

In contrastto using separatesignaturefiles, the SigCacheapproachis alsouseful in the caseof high updaterates.A signatureis in generalmuchsmallerthan the object it is createdfrom, so that when an object is updatedand thesignatureis storedon the samepage,the extra insertcost is only marginal. Ifonly a moderateamountof main memoryis usedfor the SigCache,the pagebuffer hit rateis only marginally reduced.V Readaccessesin a relationaldatabasesystemarefrequentlysetaccesseswhichcanbenefitfrom traditionalsignaturefiles.In contrast,readaccessesin ODBsaremostlynavigational.Even in thecaseof collection(for examplea set)queries,navigationwill oftenbetheresult.Unlikearelationaldatabasesystemwherethequeriedsetis a relationon storage,in anODB, a collectioncanbea collectionof referencesto objects(OIDs) ratherthantheobjectsthemselves(anobjectcanin this way belongto morethanonecollection).This navigation canmake theaveragesignatureretrieval costhigh if thesignaturesarestoredin separatefiles.If themostfrequentlyaccessedsignaturesarestoredin theSigCache,this costcanbesignificantlyreduced.V Previously, signatureshave mostlybeenusedto reducetheI/O-costs.However,signaturescanalsobe usedto reducethe CPU costs.Even if an object is resi-dentin mainmemory, a signaturecomparisoncanbeusedbeforematchingtheattributes.Especiallyin thecaseof many or largeattributes,this canreducetheCPUcostfor aPMA.

5.2 QueryprocessingusingtheSigCache

WhenaPMA is doneononeormoreattributesof anobject,thefollowingalgorithmis usedto determineif an object ��� is a match,andat thesametime maintainingthecontentsof theSigCache:

(1) A lookup is donefor theobject’s signature��� in theSigCache.If successful,theaccessbit of thesignaturein theSigCacheis set.

If this lookup is not successful,the signaturehasto be retrieved from thepagewheretheobjectis stored.This pagemayberesidentin thepagebuffer,but if it is not, thepagehasto beretrievedfrom disk.Whenthepageis found,

8

Page 9: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

thesignature��� which is storedonthepage,is insertedinto theSigCache.Theaccessbit for thesignatureis setwhenthesignatureis inserted.

(2) Thesignature��� is comparedto thequerysignature��* . If not all bit positionssetto 1 in ��* aresetto 1 in ��� , weknow for surethattheobjectdoesnotmatchthequerypredicate.If all bit positionssetto 1 in ��* aresetto 1 in ��� theobjectis apossiblematch,andwehave to retrievetheobjectin orderto comparethevalueof its attributeswith thequerypredicate.Thepagewheretheobjectisstoredon might alreadybe in the pagebuffer (becauseit hasbeenaccessedrecently, or hasbeenbroughtinto thebuffer in orderto retrieve thesignatureof oneof the objectson the page),if not, the pagehasto be retrieved fromdisk.

A queryon a collectionof objectsis donein thesameway, oncefor eachobject.Also notethatby employing signaturesasoutlinedin thealgorithmabove,theCPUcostcanalsobereduced,becausethesignaturesarecomparedbeforetheobjectiscomparedwith thesearchpredicate.

5.3 Scanoperations

Onesinglescanoperationcanmake all the signaturesstoredin the SigCachetobereplaced.This is oftennot desired.Onereasonfor this, is thatthesignaturesre-trievedduringascanoperationwill in generalhavelesschanceof beingusedagain,it is not likely thatthewholecollectionor containerto bescannedrepresentsahotset.Evenif this is thecase,it is possiblethatthenumberof signaturesretrieveddur-ing thescanis largerthanthenumberof signaturesthatfits in theSigCache.In thiscase,even if we shortlyafterdo a new scanover thecollection/container, we willhave a SigCachehit probabilityof 0. This is similar to generalbuffer managementin thecaseof scanoperations.

We have severalpossiblestrategiesto usein orderto avoid theproblemwith scanoperationsinvolving a large numberof objects.Onestrategy is to not insertsig-naturesretrievedin a scanoperationinto theSigCache(but thesignaturesalreadystoredin theSigCachecanbeusedwhenappropriate).If this strategy is used,weavoid theSigCachepollution,but at thesametimewelooseopportunitiesto benefitfrom thesignatures,asmany of thesescanoperationswill involvea smallnumberof objects.A variantof this strategy is to insert signaturesinto the SigCacheifthe scanoperationinvolvesa small or moderatenumberof objects,but not insertsignaturesif thescaninvolvesa largenumberof objects.

It is oftendifficult to know thenumberof objectsinvolvedin ascan.It will probablybemoresafeto simplylimit thenumberof signaturesfrom objectsof acertainclassto bestoredin theSigCache.For example,only 25%of theslotsin theSigCachecanbeoccupiedby oneclass.This canbeachievedby maintainingthenumberof

9

Page 10: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

signaturesof objectsfrom eachobjectclassin theSigCache.In many pageserversthe objectclassis not known to the server. In this case,the limitation canbe thenumberof objectsfrom a particularcontainer/file(the container/fileidentifier isencodedinto theOID in mostODBs).

5.4 ObjectupdatesandSigCachemaintenance

Every time an object is modified,its signaturehasto be modifiedaswell. A sig-natureis storedtogetherwith its object,andwith respectto concurrency control,loggingandrecovery, the signatureis treatedasa part of the object.If thesigna-tureof theobjectis residentin theSigCache,thesignaturein theSigCachehastobe updatedaswell. This implies that a SigCachelookup hasto be donefor eachobjectupdate.However, comparedto thenumberof CPUinstructionsnecessarytoprovidepersistentstorageof anobject,this lookupcostis only marginal.

Only signaturesthat can contribute in readqueriesare beneficialto keepin theSigCache,sothatwhenasignatureis updatedin theSigCache,theaccessbit is notset.Readaccessesarenecessaryto makeasignaturestayin theSigCache.

5.5 Redundantsignaturesin mainmemory

A signatureisstoredin thesamepageasits object,sothatif theobjectof asignaturein the SigCacheis residentin main memory, the signatureis actuallystoredtwoplaces.In orderto optimizememoryusage,it is possibleto only storesignaturesof objectsthatarenot residentin mainmemoryin theSigCache.This canbedoneby removing all signaturesof objectsin apagefrom theSigCachewhentheobjectpageis retrievedinto thepagebuffer. However, this is notagoodstrategy:

(1) Removing andreinsertingall signaturesfrom a pagesignificantly increasestheCPUcost.

(2) Whena pageis to be discardedfrom the pagebuffer, it is difficult to knowwhichsignaturesshouldbereinsertedinto theSigCache(it is difficult to main-tain theLRU policy). Theonly easysolutionis to insertall thesignaturesofthe objectson the pageinto the SigCache.However, in generalonly a fewof thesignatureson a pageare“hot spotsignatures”,so this will pollute theSigCache.The result will be a seriousreductionin the SigCachehit ratio,effectively killing all benefitsfrom this memory“optimization”.

Thesecondproblemin particularis veryserious,sowedonotconsiderthisstrategyany further.

10

Page 11: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

5.6 Signaturesandobject-orientation

An object is not necessarilyjust a collectionof simplevalueattributesasa tuplein a relationaldatabasesystem.An object can containmethods(not consideredhere)andreferencesto otherobjects,andtheobjectsin aqueriedcollectioncanbeobjectsof differentclasses,dueto theconceptof inheritance.Wenow describehowtheseissuescanbehandledwith respectto signatures.

5.6.1 Complex objectsandpathexpressions

Whenanobjectsignatureis generated,its attributesaresimplytreatedasbit strings.If anattribute is anOID (a referenceto anotherobject),theOID is simply hashedlikeany otherattribute.

Althoughnotconsideredin thispaper, avalueonapathexpressioncouldbeusedinthesignaturegenerationprocess,similar to pathexpressionindexing.For example,it could be possibleto definethat insteadof simply hashinga referenceattributeobjPtr, thevalueof objPtr W�X objPtr2 W�X attrib shouldbehashedin-stead.This would significantlyincreasethe costandcomplexity of signaturecre-ation,becauseevery timeattrib is updated,all signaturesbasedon thevalueofattrib have to beupdatedaswell. However, if querieson pathexpressionsarecommon,thisapproachcouldstill bebeneficial.

5.6.2 Inheritance

Signaturesfor objectsof two objectclassesA andB arecompatible, i.e., canbecompared,if 1) theattributesusedfor creatingsignaturesfor objectsof oneof theclassesarea subsetof theattributesusedfor creatingsignaturesfor objectsof theotherclass,2) thesignaturesizeis thesame,and3) thesamehashfunctionis used.Thiscanbethecasefor signaturesof asuperclassandoneof its subclasses.

Frequently, we want a larger signaturesize for a subclasswhen the numberofattributesis higher. If this is thecase,andwe do a queryon a collectionof objectsfrom the superclassaswell as the subclass,a separatequerysignaturehasto begeneratedfor eachsubclass(but note that only one signaturefor eachobject isgeneratedandstored).Whencomparingthe signatureof an object with a querysignature,the appropriatequery signatureis used,basedon the classthe objectbelongsto.

Example:AssumewehaveaclassA, aclassB thatinheritsfromclassA, andthatweusea differentsignaturesizesfor classA andB. If we do a queryovera collectionof A objects,this collectioncanalsocontainB objects,becausea B objectis alsoanA object.Whenthequeryis executed,two querysignaturesaregenerated,one

11

Page 12: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

querysignature�Y* for objectsof classA, andonequerysignature�Z* for objectsofclassB.

5.7 Signature recalculation

It is not strictly necessaryto storethesignaturesof theobjectson disk.Thesigna-turescanberecalculatedwhentheobjectsareretrieved.This savesdisk spaceandincreasesthepagebuffer hit rate(becausethetotal numberof objectpagesis less),but increasestheCPUcost.CPUspeedis increasingatamuchhigherratethandiskperformance,soeven if signaturerecalculationinsteadof signaturestorageis notbeneficialtoday, it is likely that this alternative will becomeattractive in thenearfuture.In themainpartof theanalyticalstudyin thispaperwewill assumethatthesignaturesarestoredondisk,but wewill alsostudyhow muchtheI/O costscanbereducedby usingrecalculation(Section7.5).

5.8 Pagevs.dualbuffering

An alternativeto finegranularitybufferingof signatures, is to combinepagebuffer-ing with finegranularitybufferingof theobjectsinstead.This is calleddualbuffer-ing. If dualbuffering is used,we try to keepwell clusteredpagesin a pagebuffer,andobjectsfrom lessclusteredpagesin anobjectbuffer.

A thoroughstudyof clientsidedualbufferingbyKemperandKossmann[8] showedthatdualbuffering cangive a substantiallyhigherbuffer performancethana pagebuffer. However, theuseof adualbuffer introducesseveralnew optionsthatmakestuningmorecomplicated,for examplewhento copy anobjectfrom thepagebufferto theobjectbuffer, andwhento copy a dirty objectin theobjectbuffer backto itshomepage.Thismakesit lessclearhow well dualbufferingwouldperformin sys-temswith complex workloads.Thestudyshowedthatdualbufferingwasmostben-eficial for a workloadwith readqueries. With updatequeriesthegainwaslessornegative.Thus,weonly considertheuseof pagebufferingin thispaper. However, itshouldbenotedthatfor workloadswheredualbuffering is beneficial,performancecanstill beimprovedby additionalfine granularitybufferingof signatures.

6 Analytical model

In order to study the gain from using signatures,we develop a cost model thatmodelstheobjectaccesscosts.Thecostmodelfocusesondiskaccesscosts,asthisis the mostsignificantcost factor. We do not considerthe additionalupdatecost

12

Page 13: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

...β0 β

...[�\ ] [�\ ] [�\ ]^[�\ ][�\ ]1

Fig. 4. Partitioneddataarea.Eachpartition containsa fraction _ � of the datagranules,and ` � of theaccessesaredoneto eachpartition.

causedby storingthe signatures,becausethe sizeof a signaturecomparedto anobject is small (between3% and6% in the study in this paper).The purposeofthis paperis to analyzethe effect of usingsignaturesin an ODB, so we focusonnavigationalaccessesandrestrictthis analysisto PMAs andsignaturesgeneratedby theuseof thesuperimposedcodingtechnique.

Thedatabasesystemmodeledin this paperis a pageserver ODB. TheODB hasatotalof a bytesavailablefor buffering.Thus,whenwetalk aboutthememorysizea , weonly considerthepartof mainmemoryusedfor buffering.Themostrecentlyusedobjectpagesarekeptin apagebuffer of size a pbuf, andthemostrecentlyusedsignaturesarekept in a SigCacheof size a scache. Themainmemorysize a is thesumof thesizeof thepagebuffer andtheSigCache,a � a pbuf b a scache.

6.1 Buffer hit rates

It is importantto have an accuratebuffer hit model.For this purpose,we usetheBhide, Dan andDias LRU buffer model (BDD) [3]. An importantfeatureof theBDD model,which makesit morepowerful thansomeothermodels,is thatit alsocanbeusedwith non-uniformaccessdistributions.

A databasein the BDD model hasa size of c datagranules(e.g.,pages),par-titioned into d partitions.As illustratedin Fig. 4, eachpartition containsa frac-tion ef� of the datagranules,and gh� of the accessesare doneto eachpartition.The distributions within eachof the partitionsare assumedto be uniform, andall accessesare assumedto be independent.We denotean accesspattern(parti-tioning set) ij�.��gk# ��������� gml�n �� eo# �������3� epl�n � � . For example,for the80/20model,i 2P8020 � �rq �tsu� q �wvx� q �yvx� q �ws � . In this paper, we will studyperformancewith fourdifferentaccesspatterns:

Set eo# e � e � gh# g � g �2P7030 0.30 0.70 - 0.70 0.30 -

2P8020 0.20 0.80 - 0.80 0.20 -

2P9010 0.10 0.90 - 0.90 0.10 -

2P9505 0.05 0.95 - 0.95 0.05 -

13

Page 14: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

6.1.1 TheBDD buffer model

We will briefly explain the main equationsin the BDD model,thederivationanddetailsbehindtheequationscanbefoundin [3].

After z accessesto thedatabase,thenumberof distinctdatagranules(pages,ob-jects,or index entries)from partition { thathavebeenaccessedis:

c �| �P}�~P�y���"~ ��z � c � i�����ef��c�����W����^W �ef��c ���� � �

Thetotal numberof distinctdatagranulesaccessedis:

c | �B}�~P�w����~���z � c � i���� l� �P� � c �| �B}"~P�y����~ ��z � c � i��

Whenthe numberof accessesz is suchthat the numberof distinct datagranulesaccessedis lessthanthebuffer sizeB (thenumberof datagranulesthatfits in thebuffer), � l �B� �u� ���"z���� � , thebuffer hit probabilityfor partition { is:

� ���"z � i������^W����^W �ef�"c ���� �

Theoverallbuffer hit probabilityis:

� ��z � i���� l� �B� � g��� ����z � i��

Thesteadystateaveragebuffer hit probabilitycanbeapproximatedto thebuffer hitratio whenthebuffer becomesfull, i.e., z is chosenasthelargestz thatsatisfies:

l��B� � c �| �B}"~P�y����~ �"z � c � i���� �

Wedenotetheaveragebuffer hit probabilityas:

�buf � � � c � i���� � ��z � i��

wherez is chosenasdescribedabove.

14

Page 15: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

6.1.2 Pagebuffer hit rate

Thenumberof diskpagesnecessaryto storeadatabasewith c����B� objects,assumingpagesize ��� andanaverageobjectsizeof ���O�B� bytes:

c����B�rl������}�� c����B�����O�B����If we store � bits signaturesof all objectson theobjectpagesaswell, thenumberof diskpagesis:

c��O�B�rl������}�� c��O�B�0�����O�B� b � ��¡ s0¢ ����With a pagebuffer sizeof a-l�¤£3¥ bytesandanoverheadfor eachitem in thebufferof ����� bytes,we cankeep c pbuf � ¦¨§ª©K«­¬®�¯!°o®±�² of thesepagesin main memory. Thebuffer hit ratein thiscaseis:

� �¤£3¥ l������� � �¤£3¥��rc pbuf� c��O�B��l �����O} � i��

6.1.3 SigCachehit rate

For eachsignaturein theSigCache,weneedto storeobjectidentifying informationin orderto know which objectit belongsto. It is not necessaryto storethewholeOID for eachsignature,variantsof prefix compressionor multi level accesstablescanbeemployed.Assuming��³�´ bytesin averageareneededto know whichobjectasignaturebelongsto, thenumberof signaturesthatfits in theSigCacheis:

c scache � a�}r�"������� ��¡ s0¢ b ����� b ��³�´

TheSigCachehit rateis:

�scache � � �¤£3¥!�rc�}r�"������ � c��O�B� � i��

6.2 Objectaccesscost

Oneor moreobjectsarestoredon eachdisk page.If we denotethetime to readorwrite apageas µ�� , theaveragecostof readingonepagefrom thedatabaseis:

µm¶ª�r� | l�������·���^W � buf page �ªµ��15

Page 16: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

In orderto reducetheobjectaccesscost,objectsareusuallyplacedon disk pagesin a way that makes it likely that morethanoneof the objectson a pagethat isread,will beneededin thenearfuture.This is calledclustering.In our model,wedefinetheclusteringfactor ¸ asthefractionof anobjectpagethat is relevant,i.e.,if thereare c�� l�����¹�ºc��O�B�3¡�c����B�rl������} objectson eachpage,and z of theobjectsonthepagewill beusedbeforethepageis discardedfrom thebuffer, ¸»� �¼ ± §�½O¾r¿ . Ifc�� l �����ÁÀ»� � q , i.e., theaverageobjectsizeis larger thanonedisk page,we define¸º��� � q . Theaverageobjectaccesscostis:

µ ���ª}��w�¶ª�r� | �O�B� � �¸�c�� l����� µm¶ª��� | l �����

Wemodelthedatabasereadaccessesas1) ordinaryobjectaccesses, and2) perfectmatch accesses, which canbenefitfrom signatures.We assumetheperfectmatchaccessesto be a fraction

�PMA of the readaccesses,andthat

� Y is the fraction ofmatchedobjectsthatareactualdrops.Thefalsedropprobabilitywhena signaturewith � bitsis generatedfrom $ attributes,is denoted� | �Â� �� �ªÃ � whereÄÅ�»Æ�ÇJÈ �´ .Theaverageobjectaccesscostemploying signaturesis:

µ }��y�¶O��� | ���B� = Costof non-PMAaccess+Costof PMA accesswheresignaturenot in SigCache+Costof PMA accesswheresignaturein SigCache

= ���^W � PMA��µ ���ª}"�y�¶ª��� | �O�B�b � PMA ���^W � scache��µ ���ª}��y�¶ª�r� | �O�B�b � PMA�

scache � � Y µ ���O}��y�¶ª��� | �O�B�b ���^W � Y ��� | µ ���ª}"�y�¶ª��� | �O�B� �This meansthat of the

�PMA accessesthat arefor perfectmatch,we only needto

readtheobjectpagein thecaseof actualor falsedrops.

7 Performance

We have now derived thecostfunctionsnecessaryto calculatethe averageobjectaccesscost,with different systemparametersand accesspatterns,and with andwithout theuseof signatures.We will in this sectionstudyhow differentvaluesoftheseparametersaffect theaccesscosts.

16

Page 17: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

Parameter ValueÉ � 8 KBÊ � 7 msÉ �O� 8 bytesÉ ³�´ 4 bytesTable1Default systemmodelparameters.

Parameter CaseI CaseIIÉ �O�B� 128bytes 2048bytesË0.2 0.8Ì

obj 16 mill. 1 mill.Í4 attributes 128attributesÎ Y 0.001 0.001Î

PMA 0.4 0.8Ï2P9010 2P9010Ð32 bits 1024bits

Table2Defaultworkloadparameters.

Workload model In this analysis,we considera databasein a stablecondition,with a total of c obj objects.The systemmodelparametersaresummarizedin Ta-ble 1. For theworkload,weconsidertwo maincases:

Case I: This is theworkloadof a traditionalapplication.Theobjectsizeis rela-tively small,andthesignaturesaregeneratedfrom asmallnumberof attributes.Insuchapplications,therewill bea mix of accesstypes,andonly someof themcanbenefitfrom usingsignatures.

Case II: This is theworkloadof oneof theemergingapplicationareas,for exam-pleXML storage.As describedby DeWitt etal. [12], thespaceoverheadwouldbeveryhigh if eachXML elementwasstoredasaseparateobject.Instead,oneobjectis usedto storeoneXML document,andtheXML elementsstoredas“light-weightobjects”insideonestorageobject.Theresultis largerobjectsthancaseI, andeachobject containsa higher numberof attributes.The attributesare frequently textstrings,anda largefractionof thequeriesin suchsystemswill befor perfectmatch

17

Page 18: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

1e-05

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cos

t in

Sec

ondsÑ

Memory Size (M /SDB)

F :

8163264

1e-05

0.0001

0.001

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cos

t in

Sec

ondsÑ

Memory Size (M /SDB)

F :

51210242048

Fig. 5. Costwhenusingdifferentsignaturesizes.CaseI to theleft, andcaseII to theright.

of oneor morewords.In orderto beableto usesignaturesfor suchqueries,theob-jectsignaturesaregeneratedby superimposingtheindividual text wordsignatures.As a result,thevalueof $ will belarge.

The default parametervaluesfor the two casesare given in Table 2. Note thatwith the default parameters,we keepthe total databasesizeconstant,the studieddatabasehasa sizeof 2 GB for both caseI and II. This is not a large database,but the interestingpoint is performanceversustherelative memorysize(memorysizerelative to databasesize).The resultsscalefor larger databasesizes,so thatgivena memorysizeof a fraction Ò of thetotal databasesize,thegainfrom usingsignaturesis thesameindependentof thedatabasesize. In all figures,thememorysize is given as memorysize relative to the databasesize, i.e. as ¦¼ ± ©wÓ ® ± ©tÓ (notethatwhensignaturesareused,thememoryneededto keepall pagesin memoryislarger).

Also note that althoughthe object accesscost dependson the clusteringfactor¸ , the gainof usingsignaturesis independentof ¸ , i.e., the gain is the samefordifferentvaluesof ¸ .

7.1 Optimalsignaturesize

Beforewestudytheperformancegainwhenusingsignatures,wehaveto determineappropriatesignaturesizes.Thesignaturesizeis a tradeoff. Usinglargesignaturesreducesthefalsedropprobability, but largesignaturesalsoreducesthenumberofsignaturesthat canbe kept in the SigCache.Too large signatureswill alsobreaktheassumptionthatsignaturesaremuchsmallerthantheobjects,andin thatwayincreasethesignaturemaintenancecost.

Fig. 5 illustratestheaccesscostswith differentsignaturesizes.Note that in orderto emphasizetheinterestingareasonthegraphs,thecostscaleis logarithmicandiscut.For caseI, weseethat � ��Ô v is asuitablesignaturesize(for mostsystemsthiswill alsobethesmallestreasonablesignaturesize,becausetheCPUwordsizeis 32bits or larger).Thesignaturesizedependsheavily on $ , thenumberof attributes

18

Page 19: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

0

1e-05

2e-05

3e-05

4e-05

5e-05

6e-05

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cos

t in

Sec

ondsÑ

Memory Size (M /SDB)

Sscache :

0.01 M0.02 M0.05 M0.1 M0.2 M0.3 M0.4 M

0

2e-05

4e-05

6e-05

8e-05

0.0001

0.00012

0.00014

0.00016

0.00018

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cos

t in

Sec

ondsÑ

Memory Size (M /SDB)

Sscache :

0.01 M0.02 M0.05 M

0.1 M0.2 M0.3 M0.4 M

Fig. 6.Costwith differentamountsof memoryreservedfor signaturecaching.CaseI to theleft, andcaseII to theright.

contributing to thesignature.With anincreasingvalueof $ , theoptimalsignaturesize increasesaswell. This is clearly evident for caseII, where � � �q v0Õ is asuitablesignaturesize.

Whendecidingthesignaturesizeit is alsoimportantto keepin mind that thesig-naturesizecannot easilybechangedafterwardsif it is too small.If this shouldbedone,reorganizingthedatabaseis necessarybecausethereis not reservedspaceforlargersignatureson theobjectpages.

7.2 OptimalSigCachesize

Thefractionof memorythatshouldbeusedfor cachingsignaturesdependson thetotal memorysize,databasesize,andworkload.The SigCachesizecaneitherbestatic (but tunable),or it can be adaptive (using cost functionsto determinethesize). It is often difficult to know the exact accesspattern.In order to evaluatewhethersignaturesareusefulin practice,wheretheSigCachesizecanbefar fromoptimal,wehavestudiedhow muchperformanceis affectedby differentSigCachesizes.This is illustratedin Fig. 6, which shows theaverageobjectaccesscostwithdifferentSigCachesizes.

Case I: Theoptimal fractionof memoryusedfor cachingsignaturesis betweenq � qpÖ and q � � , dependingon thetotalmemorysize.However, evenif a largersizeischosen,theperformanceis still acceptable.Thedifferencein costis in thiscasenotlarge.It shouldalsobenotedthatkeepingall signaturesin theSigCachein thiscaserequires256MB, soa SigCachelargerthanthis sizeis of no use.Also, its is mostimportantto keepthemostfrequentlyaccessessignaturesin theSigCache.Whenthe SigCacheis large enoughto keepthem,increasingthe SigCachesize furtherhaslessimpacton performance.

19

Page 20: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

-100

-80

-60

-40

-20

0

20

40

60

80

0 0.2 0.4 0.6 0.8 1

Gai

n in

%

Memory Size (M /SDB)

2P70302P80202P90102P9505

-100

-50

0

50

100

150

200

250

300

0 0.2 0.4 0.6 0.8 1

Gai

n in

%

Memory Size (M /SDB)

2P70302P80202P90102P9505

Fig.7.Gainfrom usingsignatureswith differentaccesspatterns.CaseI to theleft, andcaseII to theright.

Case II: In thiscase,theoptimalfractionof memoryusedfor cachingsignaturesis in theorderof q �wv , andis larger thanfor caseI. Thereasonfor this is thelargersignaturesizeusedfor caseII. GivenacertainSigCachesize,theactualnumberofsignaturesthatfits in theSigCacheis smaller.

As canbeeseenfrom thefigure,theperformancewill alsobeacceptablewith afixedSigCachesize.Evenif usingasizethatis notoptimal,goodperformancecanstill beachieved.For example,keepingtheSigCachesizeconstantat q � �4a givesgoodperformancefor awide rangeof memorysizesandaccesspatterns.

7.3 Gain fromusingsignatures

Fig. 7 shows the gain from usingsignatures,with differentaccesspatterns.Thegainfrom usingsignaturesis calculatedas:

Gain �·�q×q�Ø µ ���O}��y�readobj WÙµ }"�y�readobjµ }"�y�readobj

Ú

Case I: We seethat usingsignaturesis especiallybeneficialfor accesspatternswith anarrow hotspotarea.Therearetwo caseswhensignaturesarenotbeneficial:

(1) Whenthe memorysize is large enoughto keepmostof the hot spotobjectpages.In thiscase,it is morebeneficialto useall thememoryfor pagebuffer-ing, so that the hot spot objectscan be kept in main memory. It shouldbenotedthatwhenthememorysizeis smaller, sothatonly someof thesepagesfits in thepagebuffer, usingsignaturesis beneficial.

(2) The numberof object pagesnecessaryto storea databasewill be larger ifsignaturesarestoredaswell. If themainmemoryis largeenoughto keepmostof theobjectpagesin mainmemory(largevaluesof a ) whensignaturesare

20

Page 21: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

-100

-50

0

50

100

150

200

250

300

0 0.2 0.4 0.6 0.8 1

Gai

n in

%

Memory Size (M /SDB)

0.010.10.20.30.40.60.8

-100

-50

0

50

100

150

200

250

300

0 0.2 0.4 0.6 0.8 1

Gai

n in

%

Memory Size (M /SDB)

0.010.10.20.30.40.60.8

Fig.8.Gainfrom usingsignatureswith differentvaluesofÎ

PMA. CaseI to theleft, andcaseII to theright.

-80

-60

-40

-20

0

20

40

60

0 0.2 0.4 0.6 0.8 1

Gai

n in

%

Memory Size (M /SDB)

0.0010.01

0.10.5

-100

-50

0

50

100

150

200

250

300

0 0.2 0.4 0.6 0.8 1

Gai

n in

%

Memory Size (M /SDB)

0.0010.01

0.10.5

Fig. 9. Gainfrom usingsignatureswith differentvaluesofÎ Y . CaseI to theleft, andcase

II to theright.

notused,usingsignatureswill resultin decreasedor negativegainbecauseofa lowerpagebuffer hit rate.

Case II: In this case,usingsignaturesis very beneficialfor all accesspatterns,exceptwhenmostof theobjectpagesfits in mainmemory, asexplainedabove.Thegainis high,upto 300%whenwehaveanarrow hotspotareaandmediumamountsof memory.

7.4 Theeffectof� Y and

�PMA

Thevalueof�

PMA is thefractionof thereadaccessesthatcanbenefitfrom signa-tures.Fig. 8 illustratesthegainwith differentvaluesof

�PMA. As canbeexpected,

asmallvalueof�

PMA resultsin smallor negativegain.As we increasethevalueof�PMA, thegain increases.With largevaluesof

�PMA, aswe have assumedfor case

II, wecanachieveahighgain.Interestingto noteis thatwhenusingthesamevalueof�

PMA, thegainfor caseI andcaseII is almostthesame.

The value of� Y is the fraction of perfectmatchaccessesthat are actualdrops

(selectivity). Thepurposeof signaturesis to reducethenumberof objectsthatac-

21

Page 22: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

-80

-60

-40

-20

0

20

40

60

0 0.2 0.4 0.6 0.8 1

Gai

n in

%

Memory Size (M /SDB)

Stored SignatureRecalc

-50

0

50

100

150

200

250

300

350

0 0.2 0.4 0.6 0.8 1

Gai

n in

%

Memory Size (M /SDB)

Stored SignatureRecalc

Fig.10.Gainfrom usingsignatureswith recalculatedsignaturesandstoredsignatures.CaseI to theleft, andcaseII to theright.

tually have to beretrieved,andwe canonly benefitfrom signaturesif this numberis sufficiently low. Fig. 9 illustratesthegainwith differentvaluesof

� Y , andit is in-terestingto notethatsignatureswill bebeneficialevenwith a relatively largevalueof� Y .

7.5 Theeffectof recalculation

In theanalysissofar, wehaveassumedthatthesignaturesarestoredondisk.How-ever, asdescribedin Section5.7, it is alsopossibleto recalculatethe signatureswhenobjectpagesareretrieved.With thecurrenthardware,this will probablybetoo costly, but it is possiblethat it canbe more interestingin the future. Fig. 10illustratesthedifferencein gainbetweenrecalculatedsignaturesandstoringsigna-tures.The differenceis large enoughto make recalculationa possiblestrategy inthefuturewhentheCPUcostrelative to I/O costis evensmallerthantoday.

8 Conclusions

Wehavedescribedhow objectsignaturescanbecachedin mainmemoryin asigna-turecache,andhow thesignaturescanbeusedto reducetheaverageobjectaccesscostin a databasesystem.We developeda costmodelthatwe usedto analyzetheperformanceof theproposedapproaches,andthis analysisshowedthatsubstantialgaincanbeachieved.

Theexampleanalyzedin this paperis quitesimple.Perfectmatchaccesseshave alow complexity, andthereis only limited roomfor improvement.Therealbenefitis available in querieswherethe signaturescanbe usedto reducethe amountofdatato beprocessedat subsequentstagesof thequery, resultingin largeramountsof datathatcanbeprocessedin mainmemory. Thiscanspeedupqueryprocessingseveralordersof magnitude.

22

Page 23: Fine-GranularitySignature Caching in Object Database Systems · stored in an object database system (ODB) [12], but it can also be stored in a re-lational database system or an object-relationaldatabase

Furtherwork includesastudyonhow informationaboutrecentuseof signaturesofobjectsin a collectioncanbeemployedaspartof thequeryplanning.Recentuseof the signaturesimplies a higherSigCachehit ratethanwould otherwisebe ex-pected,thisknowledgecanbeusedto enhancecostmodelsusedin queryplanning.Otherfurtherwork includea studyof usingthecachedsignaturesduringjoin pro-cessing,which alsoinvolvedperfectmatchaccesses.Join is a frequentandcostlyoperation,andthegainfrom usingsignaturesandsignaturecachingshouldincreaseperformanceconsiderably.

References

[1] V. Aguilera,S.Cluet,P. Veltri, D. Vodislav, andF. Wattez.QueryingXML documentsin Xyleme. TechnicalReport182,Verso/INRIA,2000.

[2] E. Bertino andF. Marinaro. An evaluationof text accessmethods. In Proceedingsof the Twenty-SecondAnnualHawaii InternationalConferenceon SystemSciences,1989.Vol.II: Software Track, 1989.

[3] A. K. Bhide,A. Dan,andD. M. Dias. A simpleanalysisof theLRU buffer policy andits relationshipto buffer warm-uptransient.In Proceedingsof theNinth InternationalConferenceonData Engineering, 1993.

[4] S.S.Chawathe,S.Abiteboul,andJ.Widom. Managinghistoricalsemistructureddata.TAPOS, 5(3),1999.

[5] C. Faloutsos.Accessmethodsfor text. ACM ComputerSurveys, 17(1),1985.

[6] C. FaloutsosandR. Chan. Fast text accessmethodsfor optical and large magneticdisks: Designsand performancecomparison. In Proceedingsof the 14th VLDBConference, 1988.

[7] Y. Ishikawa, H. Kitagawa, andN. Ohbo. Evaluationof signaturefiles assetaccessfacilities in OODBs. In Proceedingsof the 1993 ACM SIGMOD InternationalConferenceonManagementof Data, 1993.

[8] A. Kemper and D. Kossmann. Dual-buffering strategies in object bases. InProceedingsof the20thVLDBConference, 1994.

[9] D. L. Lee,Y. M. Kim, andG. Patel. Efficient signaturefile methodsfor text retrieval.IEEE Transactionson Knowledge andDataEngineering, 7(3),1995.

[10] K. Nørvag. Efficient use of signaturesin object-orienteddatabasesystems. InProceedingsof Advancesin DatabasesandInformationSystems,ADBIS’99, 1999.

[11] H. SchoningandJ.Wasch.Tamino– aninternetdatabasesystem.In ProceedingsofEDBT’2000, 2000.

[12] F. Tian,D. J.DeWitt, J.Chen,andC. Zhang.Thedesignandperformanceevaluationof alternative XML storagestrategies(submittedfor publication),2000.

23