a case study of improving memory locality in polygonal

21
A Case Study of Improving Memory Locality In Polygonal Model Simplification: Metrics and Performance Victor Salamon Paul Lu Ben Watson Dima Brodsky Dave Gomboc Dept. of Computing Science University of Alberta Edmonton, Alberta, Canada salamon,paullu,dave @cs.ualberta.ca Dept. of Computer Science Northwestern University Evanston, IL, USA [email protected] Dept. of Computer Science University of British Columbia Vancouver, BC, Canada [email protected] Abstract Trade-offs between quality and performance are important issues in real-time computer graphics and visualization. Polygonal model simplification algorithms take a full-sized polygonal model as input and output a less-detailed version of the model with fewer polygons. For real-time display, fewer polygons result in faster rendering. For other applications, the speed of rendering may be traded-off for improved image quality. Even though the simplified models may be pre-computed, some full-sized models are large enough to require hours to simplify, due to paging. When computing multiple versions of the same model, with varying levels of detail, the performance of the simplification process becomes critical. R-Simp is a recent simplification algorithm that offers low run times, easy control of the output model size, and good model quality. However, when the internal data structures for the input model are larger than main memory, many simplifications algorithms, including R-Simp, suffer from poor performance due to paging. We present a case study of the R-Simp algorithm and how its data locality and performance can be substantially improved through an off-line spatial sort and an on-line reorganization of its internal data structures. We empirically characterize the data access pattern of R-Simp on three large models and present an application-specific metric, called cluster pagespan, of R-Simp’s locality of memory reference. We also experimentally show that both spatial sorting and reorganization can independently improve R-Simp’s performance by a factor of 2 to 6-fold. When both techniques are used, R-Simp’s performance improves by up to 7-fold. Keywords: graphics, visualization, performance, polygonal model simplification, memory locality, pag- ing, data structure reorganization, spatial sorting Length: 15 pages of text and 6 pages of figures/tables 1

Upload: others

Post on 26-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Case Study of Improving Memory Locality In Polygonal

A CaseStudy of Impr oving Memory Locality In Polygonal

Model Simplification: Metrics and Performance

Victor Salamon�

PaulLu�

BenWatson�

Dima Brodsky�

DaveGomboc�

�Dept. of Computing Science

University of Alberta

Edmonton, Alberta, Canada�salamon,paullu,dave�

@cs.ualberta.ca

�Dept. of Computer Science

Northwestern University

Evanston, IL, USA

[email protected]

�Dept. of Computer Science

University of British Columbia

Vancouver, BC, Canada

[email protected]

Abstract

Trade-offsbetweenqualityandperformanceareimportantissuesin real-timecomputergraphicsand

visualization.Polygonalmodelsimplificationalgorithmstakea full-sizedpolygonalmodelasinput and

outputa less-detailedversionof themodelwith fewer polygons.For real-timedisplay, fewer polygons

resultin fasterrendering.For otherapplications,thespeedof renderingmaybetraded-off for improved

imagequality. Even thoughthe simplified modelsmay be pre-computed,somefull-sized modelsare

largeenoughto requirehoursto simplify, dueto paging.Whencomputingmultipleversionsof thesame

model,with varyinglevelsof detail,theperformanceof thesimplificationprocessbecomescritical.

R-Simp is arecentsimplificationalgorithmthatofferslow runtimes,easycontrolof theoutputmodel

size,andgoodmodelquality. However, whentheinternaldatastructuresfor theinput modelarelarger

thanmain memory, many simplificationsalgorithms,including R-Simp, suffer from poor performance

duetopaging.Wepresentacasestudyof theR-Simp algorithmandhow its datalocalityandperformance

canbesubstantiallyimprovedthroughanoff-line spatialsortandanon-linereorganizationof its internal

datastructures.

We empiricallycharacterizethedataaccesspatternof R-Simp on threelargemodelsandpresentan

application-specificmetric,calledcluster pagespan, of R-Simp’s locality of memoryreference.We also

experimentallyshow that both spatialsortingandreorganizationcanindependentlyimprove R-Simp’s

performanceby a factorof 2 to 6-fold. Whenbothtechniquesareused,R-Simp’sperformanceimproves

by up to 7-fold.

Keywords: graphics,visualization,performance,polygonalmodelsimplification,memorylocality, pag-

ing, datastructurereorganization,spatialsorting

Length: 15 pagesof text and6 pagesof figures/tables

1

Page 2: A Case Study of Improving Memory Locality In Polygonal

Model Faces(Polygons) Vertices Initial VirtualMemorySize(MB)

File Sizefor PLYModel (MB)

hand 654,666 327,323 123.6 31.1dragon 871,306 435,545 164.1 32.2blade 1,765,388 882,954 331.3 80.1

Table1: Summaryof Full-SizedPolygonalModels

1 Intr oduction

In general,themorepolygonsin a three-dimensional(3D) computergraphicsmodel,themoredetailedand

realistic is the renderedimage. For example,modelsof complicatedmachineryandscannedobjectscan

containmillions or billions of polygonsin their mostdetailedform. However, thesamelevel of detailand

realismmay not be requiredin all computergraphicsapplications.For real-timedisplay, fewer polygons

resultin fasterrendering;performancemaybethemostimportantcriteria.For otherapplications,thespeed

of renderingmaybe tradedoff for improved imagequality. Theflexibility to selecta versionof thesame

modelwith adifferentlevel of detail(i.e.,adifferentnumberof polygons)canbeimportantwhendesigning

agraphicssystem.

Polygonalmodelsimplificationalgorithmstake a full-sizedpolygonalmodelasinput andoutputa ver-

sionof themodelwith fewer polygons.Althoughthesimplifiedmodelsareof high quality, it is alsoclear

that,uponcloseexamination,somedetailshavebeensacrificedto reducethesizeof themodel(Figure1). A

numberof modelsimplificationalgorithmshave beenproposed[3], eachwith their differentstrengthsand

weaknessesin termsof executiontime andtheresultingimagequality. In general,thealgorithmsareused

to pre-computesimplifiedversionsof themodelsthatcanbeselectedfor useat run-time.

However, modelsthathave over a million polygonsrequiretensof megabytesof disk storage,andre-

quire hundredsof megabytesof storagefor their in-memorydatastructures(Table1). For example,the

blade modelhasalmost1.8 million polygonsandrequires331.3megabytes(MB) of virtual memoryafter

beingreadin from an80.1MB diskfile.1 Pointersandlistsassociatingverticesto facesandotherdatastruc-

turesaccountfor thesubstantialgrowth in a model’s storagerequirementsasit is readinto mainmemory.

Furthermore,asthecomputationprogresses,thevirtual memoryfootprint of theprocessincreasesasother

datastructuresaredynamicallyallocated.

1Initial virtual memorysizeis readfrom the/proc filesystem’sstat device underLinux 2.2.12beforeany simplificationcomputationis performed.

2

Page 3: A Case Study of Improving Memory Locality In Polygonal

(a) Hand (Original,654,666polygons) (b) Hand (Simplified,R-Simp, 40,000polygons)

(c) Dragon (Original,871,306polygons) (d) Dragon (Simplified,R-Simp, 40,000polygons)

(d) Blade (Original,1,765,388polygons) (f) Blade (Simplified,R-Simp, 40,000polygons)

Figure1: Full-SizedOriginal andSimplifiedVersionsof PolygonalModels

3

Page 4: A Case Study of Improving Memory Locality In Polygonal

As thesizeof theinput modelincreases,thedatastructuresrepresentingthemodelbecometoo largeto

residein mainmemory. Of course,thevirtual memorysubsystemin contemporaryoperatingsystemsallows

for datastructuresthat do not fit in main memory. Pagesof datacanbe swappedto andfrom secondary

storage,suchasa harddisk, in responseto thedataaccesspatternsof theuser’s process.However, if there

is poorlocality andre-usein thedataaccesspatterns,performanceis greatlyreducedaspagingbecomesthe

performancebottleneck.

Consequently, weexaminetheproblemof dataaccesslocality andits effectonperformancefor amodel

simplificationalgorithm. In particular, we make a casestudy of the recently-introduced R-Simp simpli-

fication algorithm. R-Simp offers low executiontimes, easycontrol of the outputmodel size, andgood

modelquality [1]. We focuson systems-orientedrun-timeperformanceandrelatedmetricsof R-Simp, as

opposedto measuresof modelquality. A detailedstudyof othermodelsimplificationalgorithmsis partof

ouron-goingresearchandis beyondthescopeof this currentpaper.

1.1 Moti vation and RelatedWork: Lar geModels

As new modelacquisitiontechnologieshave developed,thecomplexity andsizeof 3D polygonalmodels

have increased.First, modelproducershave begun to use3D scanners(for example,[5]). Second,asthe

speedandresolutionof scientific instrumentationincreases,so hasthe sizeof the datasetsthat mustbe

visualized.Both of thesetechnologychangeshave resultedin modelswith hundredsof millions to billions

of polygons.Currenthardwarecannotcomecloseto displayingthesemodelsin real time. Consequently,

thereis a largebodyof “modelsimplification” researchaddressingthisproblem[3].

Most of thesemodel simplification algorithms(e.g., [8, 4, 2]) usea greedysearchapproachwith a

time complexity of O(� log � ), where � is thenumberof polygonsin theoriginal model.Also, thegreedy

algorithmshavepoorlocality of memoryaccess,jumpingaroundthesurfaceof theinputmodel,from puzzle

pieceto puzzlepiece. Oneexceptionto this trendis the simplificationalgorithmdescribedby Lindstrom

[6], basedon the algorithmby RossignacandBorrel [7]. Lindstrom’s algorithm is fast, but it produces

simplifiedmodelsof poorqualityand,by its nature,it is difficult to controltheexactnumberof polygonsin

thesimplifiedmodel.

In contrast,therecently-introducedR-Simp algorithm[1] producesapproximatedmodelsof substantially

higher quality than thoseproducedby Rossignacand Borrel’s approach,and R-Simp allows very exact

control of outputsize,all without a severecostin executionspeed.R-Simp is uniquein that it iteratively

refinesaninitial andvery poorapproximation,ratherthansimplifying thefull input model. Consequently,

4

Page 5: A Case Study of Improving Memory Locality In Polygonal

R-Simp hasa timecomplexity of O(� log ), ratherthanO(� log � ), where is thenumberof polygonsin

thesimplifiedmodel.Since��� in practice,R-Simp’s advantagecanbesubstantial.Furthermore,aswe

will seein Section3, R-Simp’s approachof refininga largepieceof themodelinto smallerpieces,provides

for reasonablelocality of memoryaccesses,if the“piece” canfit into mainmemory.

1.2 Overview

We begin by providing empiricalevidencefor the poor memorylocality of R-Simp whenthe modeldoes

not fit in mainmemory. Othersimplificationalgorithmswould alsohave poormemorylocality, but we fo-

cuson R-Simp. We introducecluster pagespan asanapplication-specificmetric for locality andshow how

even small improvementsin clusterpagespancangreatlyimprove the residentworking setof R-Simp. In

particular, we describehow two simpletechniquescanreduceclusterpagespanandimprove performance

by upto 7-fold. First, theoff-line techniqueof spatial sorting addsanalgorithm-independent preprocessing

stepto modelsimplification.Second,theon-linetechniqueof cluster data structure reorganization (or reor-

ganization)is application-specific.We useoptimizedandinstrumentedversionsof R-Simp to demonstrate

theperformanceof our techniquesandto provide graphsthatprovide intuition asto thenatureof R-Simp’s

memorylocality andotherperformancemetrics.

2 The Problem: Locality of Memory Accesses

To understandthedataaccesspatternsof R-Simp, we instrumentedaversionof thecodesuchthatanon-line

traceis producedof all themodel-relatedmemoryaccesses.Eachmemoryaccessis timestampedusingthe

valuereturnedby theUnix systemcall gettimeofday(), wherethefirst accessis fixedat time 0. The

resultingtracefile is processedoff-line. A scatterplot of thevirtual addressof eachmemoryaccessversus

thetimestampshowstherelatively poorlocality of referencefor R-Simp (Figure2). Notethedifferentscales

on theX andY-axesof eachgraph,eventhoughthey areof thesamesize.Thehardwareplatformis a 500

MHz PentiumIII system,runningLinux 2.2.12,with 128 MB of RAM, anda IDE swap disk. Although

therearedualprocessors,R-Simp is asingle-threadedapplication.

The instrumentationandtracingaddssignificantoverheadto R-Simp’s executiontime. Consequently,

we have scaledtheX-axis. For example,thenon-instrumentedR-Simp requires2,597secondsto compute

thesimplifiedmodelfor hand. TheinstrumentedR-Simp requiresmuchmoretime to execute,but we have

post-processedthe traceinformationso that theX-axis appearsto be2,597seconds.Thesamescalingof

theX-axisappearsfor theothermodels.Althoughthescalingmayintroducesomedistortionsto thefigures,

5

Page 6: A Case Study of Improving Memory Locality In Polygonal

it still capturesthebasicnatureof thememoryaccesspatterns.

As R-Simp computesthesimplifiedmodel,it accessespartsof theoriginal modelwith poor locality of

reference.Thegraphsshow a “white noise”pattern(Figure2). If themodelfits in main memory, paging

would not be a problem. However, the graphsintuitively explainswhy pagingis a performanceproblem

for modelsthatdo not fit in mainmemory. As will bediscussedfurther in Section4.2, thepoor locality of

referenceis not dueto apooralgorithm,but is dueto poorspatiallocality in theoriginalmodel.

Every graphin Figure 2 hassomecommonpatterns. The darker horizontalband,betweenaddress ������� ����

and � ������ ����on the LHS Y-axis, is the region of virtual memorywherethe datastructure

for the vertex list is stored. The horizontalregion above the verticesis wherethe facelist is stored. The

datastructuresanddetailsof R-Simp arediscussedin Section3. In eachgraph,thereis a vertical line (for

example,attime � ������ ��� millisecondsfor dragon) whichis whenR-Simp finishesthesimplificationprocess

andbeginsto createthenew simplifiedmodel.Thebandof memoryaccessesat thetopof thegraph,andto

theimmediateright of theverticalline, is wherethenew modelis stored.

Superimposedon eachgraphin Figure2 is a solid curve representingthecumulative majorpagefaults

of R-Simp over time,asreportedby theUnix functiongetrusage(). A majorpagefault requiresa disk

accessto swapout thevictim pageandanotherdisk accessto swap in theneededpageof data. Note that

thereis somediscrepancy betweenthemajorpagefaultsincurredby theinstrumentedR-Simp (Y-axisonthe

RHS)andthemajorpagefaultsincurredby thenon-instrumentedR-Simp (notedin thecaptionsof Figure2

andTable3). Wehave purposelynotscaledtheRHSY-axissincethatis moreproblematicthanscalingreal

time. Again, thefiguresareonly meantto give an intuitive pictureof thedataaccesspatternsinherentto

R-Simp.

SinceR-Simp incurssomany pagefaultson thesemodels,andsincethereappearsto bepoorlocality of

referencefor themodel’s in-memorydatastructures,we canfocusour attentionon improving locality and

reducingpagefaults.However, we mustfirst provide anoverview of theR-Simp algorithm.

6

Page 7: A Case Study of Improving Memory Locality In Polygonal

X-axis is scaledrealtime in milliseconds,LHS Y-axisis virtual address(scatterplot),

RHSY-axisis cumulativemajorpagefaults(solid curve) in theinstrumentedversionNOTE: X andY-axeshavedifferentrangesfor eachgraph

(a)Hand: 2,597secondsrun-time,1,812,649majorpagefaultsin non-instrumentedversion

(b) Dragon: 7,736secondsrun-time,6,410,393majorpagefaultsin non-instrumentedversion

(c) Blade14,312secondsrun-time,13,063,493majorpagefaultsin non-instrumentedversion

Figure2: MemoryAccessPatternfor Original R-Simp: Computing40,000polygonsimplifiedmodel7

Page 8: A Case Study of Improving Memory Locality In Polygonal

3 The R-Simp Algorithm

The R-Simp algorithmbegins by readingthe original model into main memory. The model is contained

within asinglecluster andrepresentstherootnodeof an � -ary tree.A clusteris acollectionof verticesand

facesfrom theoriginalmodel.Theinitial clusteris thensubdividedinto eightsub-clusters.Ultimately, each

clusterwill betransformedinto a new vertex in thesimplifiedmodel.Therefore,thesesub-clustersarethen

iteratively subdivideduntil therequirednumberof clustersis reached.Thedecisionto subdivide aclusteris

basedon theamountof variationin theorientationof thefacesin thecluster(i.e., curvature).Thefinal set

of clustersrepresentsverticesthatlie on thesimplifiedsurface.Next, theseverticesaretriangulatedto form

thefacesof thenew surface.Finally, thenew verticesandfacesarewritten to anoutputfile.

Thiswholeprocesscanbepartitionedintosixphases:input,initialization,simplification,post-simplification,

triangulation,andoutput. Below, we discussthe maindatastructuresin R-Simp andprovide moredetails

abouteachphaseof thecomputation.

3.1 Data Structures

R-Simp usesthreemain datastructuresto performmodelsimplification. The first two datastructuresare

relatively staticandareusedto storetheoriginalmodel.Thelastdatastructure,thecluster, is moredynamic

andrepresentsanodein the � -ary tree.

First, thevertex list is a globaldatastructurethatstorestheverticesfrom the input model. Thevertex

list is anarrayof Vertex structures.A Vertex structurecontainsthreemaindatafields.Thefirst field is

the !#"%$'&($*),+ coordinatesof thevertex in 3D space,representedasfloatingpoint values.Thesecondfield is

anadjacency list of vertices.Thethird field is anadjacency list of faces.Theadjacency listsareconstructed

during the initialization phaseandareindicesinto the vertex list andfacelist. All referencesto vertices

andfacesin R-Simp aredonethis way. Therefore,theglobalvertex andfacelists areaccessedthroughout

R-Simp’s execution. Note that the adjacency lists do not exist in the versionof the modelon secondary

storage,whichpartly accountsfor why themodelrequiresmorestoragewhenreadinto mainmemory.

Second,the face list is alsoa globaldatastructurethatstoresthefacesfrom the input modelandis an

arrayof structuresof typeFace andcontainsthreefields. Thefirst field containstheverticesthatmake up

the face. The secondandthird fields arethe face’s normalandits area,respectively. This informationis

usedduringthesimplificationphaseandcomputedduringtheinitializationphase.

Lastly, thecluster structurerepresentsaportionof theoriginal modelandis a nodein the � -ary tree.A

clustercontainsa list of verticesfrom theoriginal model,a representative normalto thepatch,theamount

8

Page 9: A Case Study of Improving Memory Locality In Polygonal

of surfacevariation,andthe areaof the patch. The patchareaandthe representative normalareusedto

computethesurfacevariation.

Iteratively, a leaf clusteris selectedfrom a priority queueandsubdividedinto 2, 4, or 8 sub-clusters.At

theendof thesimplificationphase,eachclusterrepresentsasinglevertex in thesimplifiedmodel.Therefore,

many verticesfrom theoriginalmodelaresimplifiedinto asinglevertex. Thenew vertex is computedin the

post-simplificationphaseandusedin thetriangulationphaseto computethesimplifiedsurface.

3.2 Phasesof R-Simp

Thefirst phaseis the input phase.Theoriginal modelis readin from thefile andthe initial vertex list and

facelist arecreated.Thesedatastructuresrequirea lot of virtual memory(Table1). Theblocksof memory

thatareallocatedin thisphaseareusedthroughouttheentiresimplificationprocess(Section2, Figure2) and

areonly freedat theend. Theseblocksof memorydo not increaseor decreasein sizeduring theprocess,

but they areaccessedthroughoutthecomputation.

In thesecondphase,the in-memorydatastructuresareinitialized. The initialization phasecreatesthe

vertex andfaceadjacency lists in thevertex list andcomputesthenormalandareafor eachfacein theface

list. The adjacency lists areusedin the simplificationphaseto determineandenforcesurfacetopology.

Theinitialization phasesequentiallyiteratesthroughthefacelist. For eachface -/. thevertices( 021 , 0�3 , and

0�4 ) areextracted.Then,for eachvertex 065 the face -/. is addedto the 0�5 ’s faceadjacency list. The vertex

adjacency listsareupdatedfor eachvertex 065 by addingtheothervertices(i.e., 0�7 where8:9;=< ) to thelist if

they arenotalreadypresent.Finally thefacenormalandthefaceareaarecomputed.

Thethird phaseis theheartof R-Simp: thesimplificationphase.In thisphasetheoriginal setof vertices

is reducedto the desiredamount. The simplificationphasestartswith all the verticesin a singlecluster.

Thisclusteris thensubdividedinto eightsub-clustersby computingaboundingboxaroundtheverticesand

subdividing theboundingbox usingthreeaxis-alignedplanespositionedat its centre.Theseeightclusters

areinsertedinto apriority queueandthemainloopof R-Simp begins.Thepseudo-codeis shown in Figure3.

The priority queue( >@? ) holdsreferencesto the clustersandordersthe clustersbasedon the surface

variation. A topologycheck,basedon breadth-firstsearch,is usedto determinewhetherthesurfacein the

clusteris connected.If thesurfaceis disconnected,theclusteris partitionedalongcomponentboundaries

andeachcomponentis put into its own sub-cluster. If thesurfaceis connectedthena statisticalapproach

is usedto approximatethe optimal partitioning for the cluster. The clusteris partitionedinto 2, 4, or 8

sub-clusters.Oncethesetof sub-clusters(� &BA ����� &�CD� ) hasbeencreated,thesurfacevariationis computed

9

Page 10: A Case Study of Improving Memory Locality In Polygonal

E , " , &�. are clusters� &BA ����� &�CD� is the set of sub-clusters created from ">F? is a priority queuedo"HG front of priority queue >F?Compute face list for "Run topology check on "if " is disconnected thenSplit " along component boundaries I � &BA ����� &�CD�

elseUsing " compute quadric JSplit " using JKI � &BA ����� &�CD�

foreach E G � &BA ����� &�CD�E � LNM�OQP/M�OSR &UT ; Compute surface variation for Einsert( s, >F? )

until sizeof( >@? ) ; desired number of vertices

Figure3: SimplificationPhase

for eachsub-cluster( E ) asthey areinsertedinto thepriority queue.

Thegreaterthesurfacevariation,thehigherthepriority of thecluster. Therefore,thesimplifiedmodel

will have more verticesin regions of high surfacevariation. When a cluster is partitioned,the surface

representedby the clusteris subdivided into sub-patchesthat arepotentiallyflatter. This processiterates

until thepriority queuecontainstherequirednumberof clusters,whichdeterminesthenumberof polygons

in thesimplifiedmodel. Eachsplit causesmoreclustersto becreatedthustheamountof memoryusedby

thisphasedependsonthelevel of detailrequired.In general,thecoarserthelevel of detailrequired,theless

memoryis needed.

The fourth phaseis thepost-simplificationphase:thefinal setof verticesfor thesimplifiedmodelare

computed.Initially, thealgorithmexaminesall theclustersin thepriority queueto ensurethateachcluster

only representsa single connectedcomponent;this is accomplishedby running the topology checkon

the cluster. Eachclusterrepresentsa singlevertex, 0BVQW , in the simplified model; for eachclusterR-Simp

computesthe optimal positionof 0BVXW . Finally, the algorithmchangesall of the pointersfrom the original

verticesin aclusterto 0BVQW .Thefifth phaseis the triangulationphase,wherethealgorithmcreatesthe faces(i.e., polygons)of the

new simplifiedmodel.Thealgorithmiteratesthroughall thefacesin theoriginalmodelandexamineswhere

10

Page 11: A Case Study of Improving Memory Locality In Polygonal

Model Output Size Original R-Simp w/Spatial Sort w/Reorganization w/Both(polygons) H:M:S seconds H:M:S seconds H:M:S seconds H:M:S seconds

hand 10,000 00:23:21 1,401 00:14:14 854 00:05:05 305 00:04:30 270hand 20,000 00:31:45 1,905 00:19:00 1,140 00:05:42 342 00:04:53 293hand 40,000 00:43:17 2,597 00:26:22 1,582 00:06:50 410 00:05:53 353

dragon 10,000 01:12:55 4,375 00:25:24 1,524 00:23:29 1,409 00:14:52 892dragon 20,000 01:37:09 5,829 00:34:18 2,058 00:29:38 1,778 00:19:43 1,183dragon 40,000 02:08:56 7,736 00:46:21 2,781 00:41:24 2,484 00:26:03 1,563

blade 10,000 02:18:26 8,306 01:07:32 4,052 01:15:39 4,539 00:59:05 3,545blade 20,000 03:01:17 10,877 01:28:09 5,289 01:35:00 5,700 01:19:52 4,792blade 40,000 03:58:32 14,312 01:58:04 7,084 02:08:40 7,720 01:51:53 6,713

Table2: Performanceof PolygonalModelSimplification:VariousStrategies

the verticesof thesefaceslie. If two verticespoint to thesamenew vertex (i.e., two original verticesare

within thesamecluster)thenthefacehasdegeneratedinto a line andthefaceis thrown away. Likewise,if

all threeverticespoint to thesamenew vertex, it is becausethenthefacehasdegeneratedinto a point, and

thatfaceis thrown awayaswell. Only if all threeoriginalverticespoint to differentnew verticesdowekeep

thefaceandaddit to thenew facelist. Theverticesof this new facepoint to theverticesin thenew vertex

list.

In the lastphase,theoutputphase,thealgorithmwrites thenew vertex list andthenew facelist to the

outputfile.

4 Impr oving Memory Locality and Performancein R-Simp

In additionto experimentswith aninstrumentedversionof R-Simp, we have alsobenchmarked versionsof

R-Simp without the instrumentationandwith compileroptimizations(-O) turnedon. Again, our hardware

platformis a 500MHz PentiumIII system,runningLinux 2.2.12,with 128MB of RAM, anda IDE swap

disk. Althoughtherearedualprocessors,R-Simp is a single-threadedapplication.R-Simp is written using

C++ andwe usedtheegcs compiler, version2.91.66on ourLinux platform.

Whenthedatastructuresof amodelfit within mainmemory, R-Simp is known to have low runtimes[1].

However, thehand, dragon, andblade modelsarelargeenoughthatthey requiremorememorythanis phys-

ically available(Table1).2 And, asthecomputationprogresses,datastructuresaredynamicallyallocated

andthesizeof thevirtual memoryusedby theprocessincreases.Consequently, thebaselineR-Simp (called

2Thehand modelby itself wouldinitially fit within 128MB, but theoperatingsystemalsoneedsmemory. Consequently, pagingoccurs.

11

Page 12: A Case Study of Improving Memory Locality In Polygonal

Model Output size Original R-Simp w/Spatial Sort w/Reorganization w/Both(polygons) (majorpagefaults) (majorpagefaults) (majorpagefaults) (majorpagefaults)

hand 10,000 1,042,444 701,789 265,328 249,849hand 20,000 1,361,492 880,498 281,338 254,083hand 40,000 1,812,649 1,167,593 297,028 268,952

dragon 10,000 3,670,516 1,409,489 1,070,466 702,050dragon 20,000 4,892,932 1,808,060 1,329,220 875,622dragon 40,000 6,410,393 2,386,238 1,767,490 1,084,346

blade 10,000 7,678,929 3,878,566 4,127,079 3,125,348blade 20,000 9,988,417 5,002,567 5,195,597 4,149,098blade 40,000 13,063,493 6,599,763 6,997,380 5,717,086

Table3: PageFaultCountof PolygonalModelSimplification:VariousStrategies

“Original R-Simp”) experienceslong run times(Table2) dueto thehighnumberof pagefaultsthatit incurs

(Table3). As expected,the larger the input model,the longerthe run time andthe higherthe numberof

pagefaults.As theoutputmodel’s sizeincreases,therun timeandpagefaultsalsoincrease.

As previously discussed(Section3), the global vertex andfacelists arelarge datastructuresthat are

accessedthroughoutthe computation.Therefore,they arenaturaltargetsof our attemptsto improve the

locality of memoryaccess.In particular, we have developedtwo different techniquesthat independently

improve the memorylocality of R-Simp andimprove its real time performanceby a factorof 2 to 6-fold,

dependingon the modeland the sizeof the simplified model. Whencombined,the two techniquescan

improve performanceby up to 7-fold.

4.1 Metrics: Cluster Pagespanand ResidentWorking Set

We introducecluster pagespan asan application-specificmetric of the expectedlocality of referenceto a

model’s in-memorydatastructure. Clusterpagespanis definedas the numberof unique virtual memory

pagesthat have to be accessedin order to touch all of the verticesand faces,in the global vertex and

facelists, in thecluster. For eachiterationof thesimplificationphase(Figure3), R-Simp’s computationis

focussedonasingleclusterfrom thefront of thepriority queue.Thetopologycheckandsplit computations

arelocalizedto onecluster, but they have to accessall of thecluster’s verticesandfaces.Therefore,if the

pagespanof theclusteris large,thereis a greaterchancethata pagefault will beincurred.Thesmallerthe

clusterpagespan,thelower thechancethatoneof its pageshasbeenpagedoutby theoperatingsystem.

12

Page 13: A Case Study of Improving Memory Locality In Polygonal

Cluster Pagespan:X-axis is scaledwallclock time,Y-axisis pagespanofclusterat front of priority queue

ResidentWorking Setof Original Model:X-axis is scaledwallclock time,Y-axisis numberofclustersin themodelwith Y 95%of pagesin mainmemory

(a)OriginalR-Simp, 40,000polygonsin output

(b) R-Simp with SpatialSortingOnly, 40,000polygonsin output

Figure4: ClusterPagespanandResidentWorkingSetfor blade: Part1

13

Page 14: A Case Study of Improving Memory Locality In Polygonal

Cluster Pagespan:X-axis is scaledwallclock time,Y-axisis pagespanofclusterat front of priority queue

ResidentWorking Setof Original Model:X-axis is scaledwallclock time,Y-axisis numberofclustersin themodelwith Y 95%of pagesin mainmemory

(c) R-Simp with ReorganizationOnly, 40,000polygonsin output

(d) R-Simp with bothSpatialSortingandReorganization,40,000polygonsin output

Figure5: ClusterPagespanandResidentWorkingSetfor blade: Part2

14

Page 15: A Case Study of Improving Memory Locality In Polygonal

Figure4(a)showstheclusterpagespanof theclusteratthefront of thepriority queueduringanexecution

of R-Simp with theblade model.Sincetheinitial clustersarevery large,theclusterpagespanis alsolargeat

thebeginningof theexecution.As clustersaresplit, theclusterpagespandecreasesover time. Thecluster

pagespandatapointsare joined by lines in order to more clearly visualizethe pattern. An instrumented

versionof R-Simp wasusedto gatherthedata.Furthermore,for eachiterationof thesimplificationloop,all

of theclustersin thepriority queueareexamined.If Z\[��B] of thepagescontainingtheverticesandfaces

of a clusterarein physicalmemory(asopposedto beingswappedout onto theswap disk), that clusteris

consideredto beresidentin mainmemory. Therefore,Figure4(a)alsoshows thecount of how many of the

clustersin thepriority queueareconsideredto bein memoryandpartof theresidentworkingset,over time.

WemodifiedtheLinux kernelsothatit is possibleto checkif aspecificvirtual pageof datais in memoryor

on theswapdisk. Thatpageof datais not pagedbackinto memoryaspartof thecheck.

The moreclustersin the residentworking set, the lesslikely it is that a computationperformedon a

clusterwill resultin pagefaults. As theclustersdecreasein size,moreclusterscansimultaneouslyfit into

main memory. Note that the verticesand facesare from the original (not simplified) model. The graph

of the residentworking setis not monotonicallyincreasingbecausetheoperatingsystemperiodically(not

continuously)reclaimspages,sotheresidentworkingsetcountcandropsignificantlywithin a shortperiod

of time. Also, the vertical lines representimportantphasetransitions. The residentworking set count

begins to decreaseafter the simplificationphasebecausethe post-simplificationand triangulationphases

dynamicallycreatenew vertex andfacelists,whichdisplacefrom memorythelists from theoriginalmodel.

Theclusterpagespangraphin Figure4(a)indicatesthatthereis poormemorylocality in how themodel

is accessedfor the initial portion of R-Simp’s execution. Consequently, the total numberof clustersthat

residein mainmemoryremainsunder2,000for mostof thesimplificationphaseof thealgorithm.

Wenow describetwo techniquesthatmeasurablyimprove memorylocality, accordingto clusterpages-

panandtheresidentworkingsetcount,andalsoimprove realtimeperformance.

4.2 Off-Line Spatial Sorting

Themodelsusedin this casestudyarestoredon disk in thePLY file format. Thefile consistsof a header,

a global list of the vertices,anda list of the faces. Eachpolygonalfaceis a triangleandis definedby a

per-facelist of threeintegers,whicharetheindex locationsof theverticesin theglobalvertex list. Thereis

no requirementthattheorderin whichverticesappearin theglobal list correspondsto their spatiallocality.

Two verticesthatarespatialneighboursin the3D geometryspacecanbein contiguousindicesin theglobal

15

Page 16: A Case Study of Improving Memory Locality In Polygonal

vertex list, or they canbeseparatedby anarbitrarynumberof othervertices.Thereis alsonospatiallocality

constrainton theorderof facesin thefile. In R-Simp, theverticesandfacesarestoredin mainmemoryin

thesameorderin which they appearin thefile, thereforetheorderof theverticesandfacesin thefile have

adirectimpacton thelayoutof thedatastructuresin memory.

Thelargeclusterpagespanvaluesseenin theearlyportionof R-Simp’s execution(Figure4(a))suggests

thatperhapsthePLY modelshavenotbeenoptimizedfor spatiallocality. Therefore,wedecidedto spatially

sort thePLY file. Themodelitself is unchanged;it hasthesamenumberof verticesandfacesat thesame

locationsin geometryspace,but we changetheorderin which theverticesandfacesappearin thefile. Our

spatialsort readsin the modelfrom the file, sortsthe verticesandfaces,andthenwrites the samemodel

backto disk in thePLY format. Therefore,thespatialsort is a preprocessingstepthatoccursbeforemodel

simplification. The spatially sortedversionof the PLY file can thenbe re-usedfor different runsof the

simplificationprogram.

Thespatialsortis arecursivealgorithm.Excludinginputandoutput,therearefivestepsto thealgorithm.

After readingthe model into memory, the first stepis to identify a 3D boundingbox for the model. The

secondstepis to selectthreeorthogonaldividing planesto partitiontheboundingbox into eightsub-boxes.

Eachdimensionof a sub-boxis half the dimensionof the original boundingbox. Then,eachsub-boxis

recursively partitionedinto eight sub-boxes. The stoppingconditionfor the recursionis whena sub-box

containslessthantwo vertices.Thethird stepis to reordertheverticesin thevertex list suchthatverticesin

thesamesub-box,whicharespatialneighbours,have indicesthatarecontiguous.

Sincetheverticeshave beenreorderedin theglobalvertex list, thefourth stepis to updatetheper-face

vertex lists to reflectthenew index order. For eachface,thethreeverticesthatdefinethefacearelistedin

monotonicallyincreasingorderof thevertex’s index in theglobal vertex list. Thefifth stepis to spatially

sort the faces.The first vertex’s global index in eachper-facelist is usedastheprimary sort key, andthe

otherindicesarethesecondaryandtertiarysortkeys, respectively. Consequently, facesthatareneighbours

in geometryspacearealsoneighboursin thefacelist. Finally, themodelis written backto disk in thePLY

formatwith thevertex andfacelists in thenew spatiallysortedorder.

Our implementationof the spatialsort is written in C++ andusesthe standardC++ library’s efficient

std::partition() andstd::sort() functionsto reordertheverticesandsort the faces.Spatially

sortingthehand, dragon, andblade modelsrequire28, 28, and89 seconds,respectively, on our 500MHz

PentiumIII-basedLinux platform. Again, themodelsonly have to besortedoncesincethenew PLY files

areretainedon disk.

16

Page 17: A Case Study of Improving Memory Locality In Polygonal

WhenR-Simp is given a spatiallysortedmodel for the input, clusterpagespanis reducedthroughout

theprocess’s executionwith a resultingimprovementin theresidentworking setof theoriginal model.For

blade, therearesignificantimprovementsin bothclusterpagespanandresidentworking set(Figure4(b)),

which resultsin morethana 50%reductionin R-Simp’s executiontime (Figure4(b) andTable2). Spatial

sortingalsobenefitsthehand anddragon models(Table2).

Sincespatialsortingdramaticallyimprovesthememorylocality of R-Simp without any changesto the

algorithmitself, it suggeststhatR-Simp doesexhibit goodmemorylocality if theinputmodelalsohasgood

spatiallocality. If theinput modelhaspoorlocality, which is thecasefor thecommonly-usedPLY files for

thesemodels,thenR-Simp will suffer a performancehit. Although thespatialsort is currentlyan off-line

preprocessingphasefrom R-Simp, we mayintegrateit into our implementationof R-Simp in thefuture. In

themeantime,our experimentsandmeasurementssuggestthatotherresearchersshouldconsiderspatially

sorting the input modelsfor their graphicssystems.Spatialsorting is a simple and fast procedurewith

substantialperformancebenefitsfor R-Simp (Table2). Whetherothersimplificationalgorithmswill also

benefitfrom spatialsortingis unclearandis left for futurework.

4.3 On-Line Reorganizationof Data Structures

Theperformanceimprovementsdueto astaticspatialsortaresubstantial.However, asamodelis iteratively

simplified, theremay be an opportunityto dynamicallyimprove memorylocality above andbeyond the

benefitsof the spatialsort. In particular, asa clusterin themodelis split andrefinedby R-Simp, the size

of the resultingclustersis reduced,but the verticesandfacesreferencedby the clusterarethe samedata

structuresaswhenthe modelwasfirst readinto main memory. A clusterthat containsverticesandfaces

from disparatepartsof theoriginal modelwill have a high clusterpagespanevenif thenumberof vertices

andfacesarelow.

We have implementeda versionof R-Simp that dynamicallyreorganizesits internaldatastructuresin

order to reduceclusterpagespan.Specifically, before a sub-clusteror clusteris insertedinto the priority

queue(seeFigure3), theclustermaybeselectedfor cluster data structure reorganization (or simply, reor-

ganization). If selectedfor reorganization,theverticesandfacesassociatedwith theclusterarecopiedfrom

theglobalvertex andfacelists into new lists on new pagesof virtual memory. Thebasicideais similar to

compactingmemoryto reducefragmentationin memorymanagement.Internalto theclusterdatastructure,

the lists of verticesandfacesnow refer to thenew vertex andfacelists, therebyguaranteeingtheminimal

possibleclusterpagespan.If theclustermakesit to thefront of thepriority queueagain,thetopologycheck

17

Page 18: A Case Study of Improving Memory Locality In Polygonal

andothercomputationsperformedon theclusterwill have fewerpagefaultsdueto theimprovedlocality of

how thedatais stored.

Sincetherearecopying andmemoryallocationoverheadsassociatedwith reorganization,it is notdone

indiscriminately. Two criteriamustbemetbeforereorganizationis performed:

1. Theclusterpagespanafter reorganizationmustbelessthan � pages.

2. Theclusteris aboutto beinsertedinto thepriority queuewithin thefront 50%of clustersin thequeue

(i.e., theclusteris in thefront half of thequeue).

The first criteria controlsat what point reorganizationis performed,in termsof clustersize. Reorga-

nizing whenclustersarelarge is expensive andinherentlypreserveslargeclusterpagespans.Reorganizing

whenclustersaresmallmaydelayreorganizationuntil mostof thesimplificationphaseis completed,thus

reducingthechancesof benefitingfrom the reorganization.A simplecountof thenumberof verticesand

facesin thecluster, multipliedby theamountof storagerequiredfor each,candeterminethepotentialcluster

pagespanafter reorganization.Thespecificvalueof � chosenasthe thresholdfor thefirst criteriahas,so

far, beendeterminedexperimentally. For example,asweincrease� , thebenefitsof reorganizationfor blade

increasesuntil it plateausandstartsto decrease(Figure9). The optimal valuefor � is found to be 1,024

pages(i.e.,4 MB giventhe4 K pagesonourplatform)for blade. For thethreemodels,theoptimalvalueof

� wasempiricallybetween1,024and4,096pages.

Thesecondcriteriatriesto maximizethechancesthata reorganizedclusterwill beaccessedagain(i.e.,

it will reachthe front of thepriority queueagainandbere-used).Reducingthepagespanof a clusterthat

is not accessedagainuntil thepost-simplificationphaseproducesfewer benefits.Thevalueof “50%” was

alsoempiricallydetermined.Sincethepriority valueof theclusteris computedbeforeit is insertedinto the

queue,this is acheapcriteriatestto perform.

Thebenefitsof reorganizationarereflectedin thereducedclusterpagespanandincreasedresidentwork-

ing set(Figure5(c)), reducednumberof pagefaults(Table3), andmostimportantly, in thelower run times

(Table2). By itself, reorganizationcansubstantiallyimprove performance.In fact,for hand (Figure6), re-

organizationaloneproducesa substantiallygreaterimprovementthanspatialsortingalone.And, for blade

(Figure8), spatialsortingaloneproducesa slightly greaterimprovementthanreorganizationalone. Both

techniquesareuseful. But, whenbothspatialsortingandreorganizationareappliedto R-Simp, thereis an

additionalbenefit(Table2, Figure6, Figure7, Figure8). Of course,thetwo techniqueshave someoverlap,

sothebenefitsarenot completelycumulative.

18

Page 19: A Case Study of Improving Memory Locality In Polygonal

0

20

40

60

80

100

120100

61.0

10,000

21.8 19.3

100

59.8

20,000

18.0 15.4

100

60.9

40,000

15.8 13.6

Outputsize(polygons)

NormalizedExecutionTime (lower is better)

R-Simp SpatialSort Reorganization Both

Figure6: Hand:NormalizedExecutionTime

0

20

40

60

80

100

120100

34.8

10,000

32.220.4

100

35.3

20,000

30.520.3

100

35.9

40,000

32.120.2

Outputsize(polygons)

NormalizedExecutionTime (lower is better)

R-Simp SpatialSort Reorganization Both

Figure7: Dragon:NormalizedExecutionTime

0

20

40

60

80

100

120100

48.8

10,000

54.642.7

100

48.6

20,000

52.444.1

100

49.5

40,000

53.946.9

Outputsize(polygons)

NormalizedExecutionTime (lower is better)

R-Simp SpatialSort Reorganization Both

Figure8: Blade:NormalizedExecutionTime

19

Page 20: A Case Study of Improving Memory Locality In Polygonal

0

20

40

60

80

100

120100

64

86.4

128

79.5

256

73.4

512

72.4

1024

73.3

2048

73.1

4096Threshold̂ for Reorganization(in pages)

NormalizedExecutionTime(lower is better)

Figure9: Blade:VaryingReorganizationThreshold,NormalizedExecutionTime,40,000polygonoutput

5 Concluding Remarksand Futur e Work

In computergraphicsandvisualization,thecomplexity of themodelsandthesizeof thedatasetshavebeen

increasing.Modern3D scannersandrising standardsfor imagequality have fueleda trendtowardslarger

3D polygonalmodelsandalsointo modelsimplificationalgorithms.Suchalgorithmsreducethenumberof

polygonsin a modelwhile attemptingto maximizethequality of thesimplifiedmodel. By pre-computing

simplified modelsin advance,a modelwith the appropriatelevel of detail for a given usecanbe selected

andeffective trade-offs canbemadebetweenquality andperformance.

R-Simp is a new modelsimplificationalgorithmwith low run times,easycontrol of the outputmodel

size,andgoodmodelquality. However, R-Simp andsimilaralgorithmssuffer from pagingbottleneckswhen

the model is too large to fit in physicalmemoryandthe run-timescanextend into hours. No matterthe

amountof RAM that onecanafford, theremay be a model that is too large to fit in memory. We have

developedspatialsortinganddatastructurereorganizationtechniquesto improve the memorylocality of

R-Simp andexperimentallyshown it to improve performanceby up to 7-fold. We have alsointroducedthe

clusterpagespanmetricasonemeasureof memorylocality in modelsimplification.

Froma systemspoint of view, eventhoughthesimplifiedmodelscanbepre-computed,it is important

that thesimplificationprocessbeasfastaspossible.Often,many differentsimplifiedversionsof thesame

full-sizedmodelmustbepre-computed;themoreversions,themorelikely thatamodelwith acertainlevel

of detailisavailablefor agivendisplaysituation.If simplificationtakestoolong,thedesignerof thegraphics

or visualizationsystemmaycomputefewer versionsof themodel,resultingin a poorertrade-off between

imagequality andrenderingperformancein a given situation. Sincememorylocality andpagingappear

to be first-orderdeterminantsof performancein the simplificationof large models,we have characterized

locality throughmetricsanddevelopedtechniquesto significantlyimprove performancein practice.

20

Page 21: A Case Study of Improving Memory Locality In Polygonal

For futurework, weplanto studymemorylocality in othersimplificationalgorithmsandto determineif

spatialsortingandreorganizationcanalsoimprove their performance.More experimentationandanalysis

is alsorequiredto understandthespecificthresholdsandcriteriausedto selectively reorganize.Finally, we

needto repeatourexperimentsondifferenthardwareplatforms,with differentratiosof physicalmemoryto

modelsize,to betterunderstandthatparameterspace.

References

[1] Dmitry Brodsky and BenjaminWatson. Model simplification throughrefinement. In Sidney Fels and Pierre

Poulin, editors, Graphics Interface ’00, pages221–228.CanadianInformation ProcessingSociety, Canadian

Human-ComputerCommunicationsSociety, May 2000.

[2] Michael GarlandandPaul S. Heckbert. Surfacesimplificationusingquadricerror metrics. In SIGGRAPH 97

Conference Proceedings, AnnualConferenceSeries,pages209–216.ACM SIGGRAPH,AddisonWesley, August

1997.

[3] Paul S.HeckbertandMichaelGarland.Survey of polygonalsurfacesimplificationalgorithms.Technicalreport,

CarnegieMellon University, 1997.Draft Version.

[4] HuguesHoppe,Tony DeRose,TomDuchamp,JohnMcDonald,andWernerStuetzle.Meshoptimization.In SIG-

GRAPH 93 Conference Proceedings, volume27 of Annual Conference Series, pages19–26.ACM SIGGRAPH,

AddisonWesley, August1993.

[5] M. Levoy, K. Pulli, B. Curless,S.Rusinkiewicz,D. Koller, L. Pereira,M. Ginzton,S.Anderson,J.Davis, J.Gins-

berg, J.Shade,andD. Fulk. Thedigital michaelangeloproject:3dscanningof largestatues.In Proceedings ACM

SIGGRAPH 2000, pages131–144.ACM, 2000.

[6] PeterLindstrom. Out-of-coresimplificationof largepolygonalmodels.In Proceedings ACM SIGGRAPH 2000,

pages259–262.ACM, 2000.

[7] JarekRossignacandPaulBorrel. Multi-resolution3D approximationsfor renderingcomplex scenes.In Modeling

in Computer Graphics: Methods and Applications, pages455–465,Berlin, 1993.Springer-Verlag.

[8] William J. Schroeder, JonathanA. Zarge,andWilliam E. Lorensen.Decimationof trianglemeshes.Computer

Graphics, 26(2):65–70,July 1992.

21