accelerating pathology image data cross-comparison on cpu ... · core cpu, have been available. in...

12
Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems Kaibo Wang 1 Yin Huai 1 Rubao Lee 1 Fusheng Wang 2,3 Xiaodong Zhang 1 Joel H. Saltz 2,3 1 Department of Computer Science and Engineering, The Ohio State University 2 Center for Comprehensive Informatics, Emory University 3 Department of Biomedical Informatics, Emory University 1 {wangka,huai,liru,zhang}@cse.ohio-state.edu 2, 3 {fusheng.wang,jhsaltz}@emory.edu ABSTRACT As an important application of spatial database in pathol- ogy imaging analysis, cross-comparing the spatial bound- aries of a huge amount of segmented micro-anatomic ob- jects demands extremely intensive data and compute op- erations, requiring high throughput at the affordable cost. However, the performance of spatial database systems has not been satisfactory since their implementations of spatial operations cannot fully utilize the power of modern parallel hardwares. In this paper, we provide a customized soft- ware solution that exploits GPUs and multi-core CPUs to accelerate spatial cross-comparison in a cost-effective way. Our solution consists of an efficient GPU algorithm and a pipelined system framework with task migration support. Extensive experiments with real-world data sets show the ef- fectiveness of our solution, which improves the performance of spatial cross-comparison by over 18 times compared with a parallelized spatial database system. 1. INTRODUCTION Digitalized pathology images generated by high resolution scanners enables the microscopic examination of tissue spec- imens to support clinical diagnosis and biomedical research [9]. With the emerging pathology imaging technology, it is essential to develop and evaluate high quality image analy- sis algorithms, with the iterative efforts on algorithm valida- tion, consolidation, and parameter sensitivity studies. One essential task to support such work is to provide efficient tool for cross-comparing millions of spatial boundaries of segmented micro-anatomic objects. A commonly adopted cross-comparing metric is Jaccard similarity [33] that com- putes the ratio of the total area of the intersection divided by the total area of the union between two polygon sets. Building high-performance cross-comparing tools is chal- lenging, due to data explosion in pathology imaging analysis, as in other scientific domains [21][26]. Whole-slide images in pathology made by scanning microscope slides at diagnos- tic resolution are very large: a typical image may contain 100,000x100,000 pixels, and millions of objects such as cells or nuclei. A study may involve hundreds of images obtained from a large cohort of subjects. For large scale interrelated analysis, there may be dozens of algorithms – with vary- ing parameters – to generate many different result sets to be compared and consolidated. Thus, derived data from im- ages of a single study is often in the scale of tens of terabytes, and would be much more in future clinical environments. Pathologists mainly rely on spatial database management systems (SDBMS) to execute spatial cross-comparing [34]. However, cross-comparing a huge amount of polygons is time-consuming using SDBMSs, which cannot fully use rich parallel resources of modern hardware. In the era of high- throughput computing [25], unprecedentedly rich and low- cost parallel computing resources, including GPU and multi- core CPU, have been available. In order to use these re- sources for maximizing execution performance, applications have to fully exploit both thread-level and data-level paral- lelisms and well utilize SIMD (Single Instruction Multiple Data) vector units to parallelize workloads [18]. However, supporting spatial cross-comparison on a CPU- GPU hybrid platform imposes two major challenges. First, parallelizing spatial operations, such as computing areas of polygon intersection and union, on GPUs requires efficient algorithms. Existing CPU algorithms, e.g., those used in SDBMSs, are branch intensive with irregular data access patterns, which makes them very hard, if not impossible, to parallelize on GPUs. Efficient GPU algorithms, if exist, must successfully exploit massive data parallelism in the cross-comparing workload and execute them in an SIMD fashion. Second, a GPU-friendly system framework is re- quired to drive the whole spatial cross-comparing workload. The special characteristics of the GPU device require data batching to mitigate communication overhead, and coordi- nated device sharing to prevent resource contention. Fur- thermore, due to the diversity of hardware configurations and workloads, task executions have to be balanced between GPU and CPU to maximize resource utilization. In this paper, we present a customized solution, SCCG (Spatial Cross-Comparison on CPUs and GPUs), to address the challenges. Through detailed profiling, we identify that the bottleneck of cross-comparing query execution is to com- pute the areas of polygon intersection and union. This ex- plains the low performance of SDBMSs and motivates us to design an efficient GPU algorithm, PixelBox, to accelerate the spatial operation. Both the design and the implemen- tation of the algorithm are optimized thoroughly to ensure its high performance on GPUs. Moreover, we develop a pipelined system framework for the whole workload, and design a dynamic task migration component to solve the load balancing problem. The pipelined framework has ad- vantages for its natural support of data batching and GPU sharing. The task migration component further improves

Upload: others

Post on 30-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Accelerating Pathology Image Data Cross-Comparison onCPU-GPU Hybrid Systems

    Kaibo Wang1 Yin Huai1 Rubao Lee1 Fusheng Wang2,3 Xiaodong Zhang1 Joel H. Saltz2,3

    1Department of Computer Science and Engineering, The Ohio State University2Center for Comprehensive Informatics, Emory University3Department of Biomedical Informatics, Emory University

    1{wangka,huai,liru,zhang}@cse.ohio-state.edu 2,3{fusheng.wang,jhsaltz}@emory.edu

    ABSTRACTAs an important application of spatial database in pathol-ogy imaging analysis, cross-comparing the spatial bound-aries of a huge amount of segmented micro-anatomic ob-jects demands extremely intensive data and compute op-erations, requiring high throughput at the affordable cost.However, the performance of spatial database systems hasnot been satisfactory since their implementations of spatialoperations cannot fully utilize the power of modern parallelhardwares. In this paper, we provide a customized soft-ware solution that exploits GPUs and multi-core CPUs toaccelerate spatial cross-comparison in a cost-effective way.Our solution consists of an efficient GPU algorithm and apipelined system framework with task migration support.Extensive experiments with real-world data sets show the ef-fectiveness of our solution, which improves the performanceof spatial cross-comparison by over 18 times compared witha parallelized spatial database system.

    1. INTRODUCTIONDigitalized pathology images generated by high resolution

    scanners enables the microscopic examination of tissue spec-imens to support clinical diagnosis and biomedical research[9]. With the emerging pathology imaging technology, it isessential to develop and evaluate high quality image analy-sis algorithms, with the iterative efforts on algorithm valida-tion, consolidation, and parameter sensitivity studies. Oneessential task to support such work is to provide efficienttool for cross-comparing millions of spatial boundaries ofsegmented micro-anatomic objects. A commonly adoptedcross-comparing metric is Jaccard similarity [33] that com-putes the ratio of the total area of the intersection dividedby the total area of the union between two polygon sets.

    Building high-performance cross-comparing tools is chal-lenging, due to data explosion in pathology imaging analysis,as in other scientific domains [21][26]. Whole-slide images inpathology made by scanning microscope slides at diagnos-tic resolution are very large: a typical image may contain100,000x100,000 pixels, and millions of objects such as cellsor nuclei. A study may involve hundreds of images obtainedfrom a large cohort of subjects. For large scale interrelatedanalysis, there may be dozens of algorithms – with vary-ing parameters – to generate many different result sets tobe compared and consolidated. Thus, derived data from im-ages of a single study is often in the scale of tens of terabytes,

    and would be much more in future clinical environments.Pathologists mainly rely on spatial database management

    systems (SDBMS) to execute spatial cross-comparing [34].However, cross-comparing a huge amount of polygons istime-consuming using SDBMSs, which cannot fully use richparallel resources of modern hardware. In the era of high-throughput computing [25], unprecedentedly rich and low-cost parallel computing resources, including GPU and multi-core CPU, have been available. In order to use these re-sources for maximizing execution performance, applicationshave to fully exploit both thread-level and data-level paral-lelisms and well utilize SIMD (Single Instruction MultipleData) vector units to parallelize workloads [18].

    However, supporting spatial cross-comparison on a CPU-GPU hybrid platform imposes two major challenges. First,parallelizing spatial operations, such as computing areas ofpolygon intersection and union, on GPUs requires efficientalgorithms. Existing CPU algorithms, e.g., those used inSDBMSs, are branch intensive with irregular data accesspatterns, which makes them very hard, if not impossible,to parallelize on GPUs. Efficient GPU algorithms, if exist,must successfully exploit massive data parallelism in thecross-comparing workload and execute them in an SIMDfashion. Second, a GPU-friendly system framework is re-quired to drive the whole spatial cross-comparing workload.The special characteristics of the GPU device require databatching to mitigate communication overhead, and coordi-nated device sharing to prevent resource contention. Fur-thermore, due to the diversity of hardware configurationsand workloads, task executions have to be balanced betweenGPU and CPU to maximize resource utilization.

    In this paper, we present a customized solution, SCCG(Spatial Cross-Comparison on CPUs and GPUs), to addressthe challenges. Through detailed profiling, we identify thatthe bottleneck of cross-comparing query execution is to com-pute the areas of polygon intersection and union. This ex-plains the low performance of SDBMSs and motivates us todesign an efficient GPU algorithm, PixelBox, to acceleratethe spatial operation. Both the design and the implemen-tation of the algorithm are optimized thoroughly to ensureits high performance on GPUs. Moreover, we develop apipelined system framework for the whole workload, anddesign a dynamic task migration component to solve theload balancing problem. The pipelined framework has ad-vantages for its natural support of data batching and GPUsharing. The task migration component further improves

  • system throughput by balancing tasks on GPU and CPU.The main contributions of this paper are as follows: 1)

    PixelBox, an efficient GPU algorithm and optimized im-plementation for computing Jaccard similarity of polygonsets; 2) a pipelined framework with task migration for pro-cessing the whole workload of spatial cross-comparison ona CPU/GPU hybrid platform; and 3) a demonstration ofour solution’s performance (18X speedup over a parallelizedSDBMS) with extensive and thorough experiments usingreal-world pathology imaging data.

    The rest of the paper is organized as follows. Section 2introduces the background and identifies the problem withSDBMSs in processing spatial cross-comparing queries. Anew GPU algorithm, PixelBox, is presented in Section 3to accelerate the bottleneck spatial operations. Section 4introduces the pipelined framework and the design of a taskmigration facility for workload balancing. Comprehensiveexperimental study is presented in Section 5, followed byrelated works in Section 6 and conclusions in Section 7.

    2. PROBLEM IDENTIFICATION

    2.1 Background: Spatial Cross-ComparisonA critical step in pathology imaging analysis is to extract

    spatial locations and boundaries of micro-anatomic objects,which are then represented with polygons, from digital slideimages using segmentation algorithms [9]. The effectivenessof a segmentation algorithm depends on a variety of factors,such as the quality of microtome staining machines, stainingtechniques, peculiarities of tissue structures and others. Aslight change of algorithm parameters may also lead to dra-matic variations in segmentation results. As a result, evalu-ating the effectiveness and sensitivity of segmentation algo-rithms has been a critically important operation in pathol-ogy imaging studies.

    The core computation is to cross-compare two sets of poly-gons, which are segmented by different algorithms or thesame algorithm with different parameters, to obtain theirdegree of similarity. Jaccard similarity, due to its simplicityand meaningful geometric interpretation, has been widelyused in pathology imaging analysis to measure the similar-ity of two polygon sets.

    Suppose P and Q are two sets of polygons representingthe spatial boundaries of objects generated by two methodsfrom the same image. Their Jaccard similarity is defined as

    J =‖P ∩Q‖‖P ∪Q‖ ,

    where P ∩Q and P ∪Q denote the intersection and the unionof P and Q, and ‖·‖ is defined as the area of one or multiplepolygons in a polygon set. To further simplify computation,researchers in digital pathology use a variant definition of

    Jaccard similarity: let r(p, q) = ‖p∩q‖‖p∪q‖ , then

    J ′ = 〈{r(p, q) : p ∈ P, q ∈ Q, ‖p ∩ q‖ 6= 0}〉 , (1)

    in which 〈·〉 represents the average value of all the elementsin a set. The greater the value of J ′ is, the more likely Pand Q resemble each other. Compared with J , J ′ does notconsider missing polygons that appear in one polygon setbut have no intersecting counterpart in the other. Missingpolygons can be easily identified by comparing the numberof polygons that appear in the intersection with the number

    of polygons in each polygon set. Other additional measure-ments of similarity, such as distance of centroids, are ignoredin our discussion, as their computational complexity is low.

    What makes the computation of J ′ highly challenging isthe amount of polygons involved, which is usually enormous,during spatial cross-comparison. Due to the high depend-ability required by medical analysis, the test image base hasto be sufficiently large — hundreds of whole slide images arecommon, with each image generating millions of polygonsto represent the spatial boundaries of objects.Since a sin-gle image contains a great number of micro-atomic objects,the average sizes of the polygons extracted from pathologyimages are usually very small.

    To expedite both segmentation and spatial cross-comparing,large image files are usually pre-partitioned into many smalltiles so that they can fit into memory and allow parallelsegmentations. The generated polygon files for each wholeimage also reflect the structure of such partitioning: poly-gons extracted from a single image tile are contained in asingle polygon file; a group of polygon files constitute thesegmentation result for a whole image; different segmenta-tion results for the same image are represented with differentgroups of polygon files, which are cross-compared with eachother for the purpose of algorithm validation or sensitivitystudies.

    In the rest of the paper, we refer to the area of the inter-section of two polygons as area of intersection, and the areaof the union of two polygons as area of union.

    2.2 Existing Solutions with SDBMSsPathologists mainly rely on SDBMSs to support spatial

    cross-comparison [34]. In this way, the workload is expressedas an SQL query with spatial operations. Figure 1(a) showsa cross-comparing query in PostGIS [1] SQL grammar thatcomputes the Jaccard similarity of two polygon sets, named‘oligoastroiii 1 1’ and ‘oligoastroiii 1 2’. The join conditionis expressed with spatial predicate ST Intersects, which testswhether two polygons intersect or touch. For each pair of in-tersecting polygons, their area of intersection, area of union,and thus the ratio of the two areas are computed. Spatial op-erators ST Intersection and ST Union compute the bound-aries of the intersection and the union of two polygons, whileST Area returns the area of one or a group of polygons. Fi-nally, the ratios are averaged to derive the similarity scorefor the whole image.

    For the SDBMS solution, a typical query execution con-sists of three major steps. First, polygon files (raw data)are first loaded into the spatial database; Second, spatialindexes are built based on the minimum bounding rectan-gulars (MBRs) of polygons; Finally, spatial queries are thenexecuted to compute the Jaccard similarity.

    According to the simple formula ‖p∪q‖ = ‖p‖+‖q‖−‖p∩q‖, the query can be re-written so that only the ST Intersectionoperator will be executed for each pair of intersecting poly-gons, while the area of union can be computed indirectlythroughput the formula. Moreover, ST Intersects can beremoved since we only need records with ratio > 0 andwhether two polygons intersect can be determined by theirarea of intersection. By replacing ST Intersects with the&& operator, which tests whether the MBRs of two poly-gons intersect, we can further optimize the query, as shownin Figure 1(b).

  • (a) A cross-comparing query without optimizations.

    (b) A cross-comparing query with optimizations.

    Figure 1: Cross-comparing queries for the Jaccard similarity oftwo polygon sets extracted from the same image.

    2.3 Performance Profiling of SDBMS SolutionsTo identify the performance bottleneck of cross-comparing

    queries in SDBMSs, we performed a set of experiments withPostGIS, a popular open-source SDBMS 1 2. We used a real-world dataset extracted from a brain tumor slide image. Thetotal size of the dataset in raw text format is about 750MiB,with two sets of polygons (representing tumor nuclei) eachcontaining over 450,000 polygons, and over 570,000 pairs ofpolygons with MBR intersections. Details of the platformand the dataset will be described in Section 5.

    We split the query execution into separate components,and profiled the time spent by the query engine on eachcomponent during a single-core execution. The result is pre-sented in Figure 2 for both the unoptimized and optimizedqueries. ‘Index Search’ refers to the testing of MBR intersec-tions based on the built spatial indexes. ‘Area Of Intersection’and ‘Area Of Union’ represent computing the areas of inter-section and union, which correspond to the two combo oper-ators, ST Area(ST Intersection()) and ST Area(ST Union()).‘ST Area’ denotes the other two stand-alone ST Area oper-ators in the optimized query.

    For the query without optimizations, ST Intersects (21.8%),Area Of Intersection (37.4%), and Area Of Union (36.7%)take the highest percentages of execution time, represent-ing the bottlenecks of the query execution. For the op-timized query, since ST Intersects and Area Of Union areremoved from the SQL statement, Area Of Intersection be-comes the sole performance bottleneck, capturing almost90% of the total query execution time. As the bottom twobars show, very little time (less than 6%) was spent on in-dex building and index search in both queries. The bar forST Area indicates that the time to compute polygon areasare negligible, and further indicates that the high overheadof Area Of Intersection and Area Of Union comes from spa-tial operators ST Intersection and ST Union.

    The profiling result explains the low performance of SDBMSsin supporting cross-comparing queries — computing the in-tersection/union of polygons is too time-consuming as thenumber of polygon pairs is large. SDBMSs usually rely on

    1We also performed similar experiments on a mainstreamcommercial SDBMS, but its performance was much worse.For simplicity, we only present the results for PostGIS.2Based on our communication with the community of SciDB[2], spatial cross-comparing queries are not natively sup-ported by SciDB.

    Figure 2: Execution time decomposition of cross-comparingqueries in PostGIS on a single core.

    some geometric computation libraries, for example, GEOS [3]in PostGIS, to implement spatial operators. Designed tobe general-purpose, the algorithms used by these librariesto compute the intersection and union of two polygons arecompute-intensive and very difficult to parallelize. We ana-lyzed the source codes of the respective functions for com-puting polygon intersection and union in GEOS and anotherpopular geometric computation library, CGAL [4], and findthat only very few codes can be parallelized without sig-nificantly changing algorithm structures. Both GEOS andCGAL use generic sweepline algorithms [10], which are notbuilt for computationally intensive queries and thus lead tolimited performance in SDBMSs.

    Using a large compute cluster can improve system perfor-mance. However, unlike in many high-performance comput-ing environments, pathologists can barely afford expensivefacilities in real clinical settings [23]. A cost-effective andmeanwhile highly productive solution is thus highly desir-able. This motivates us to design a customized solution toaccelerate large-scale spatial cross-comparisons. To elimi-nate the performance bottleneck, our solution needs an effi-cient GPU algorithm for computing the areas of intersectionand union, as will be introduced next.

    3. THE PIXELBOX ALGORITHMWe describe a GPU algorithm, PixelBox, to compute the

    areas of intersection and union. The design of PixelBoxmainly solves three problems: 1) how to parallelize the com-putation of area of intersection and area of union on GPUs,2) how to reduce compute intensity when polygon pairs arerelatively large, and 3) how to implement the algorithm ef-ficiently on GPUs. We use the terms defined in NVIDIACUDA [5] to describe PixelBox. However, the algorithm de-sign is general and applicable to other GPU architectures.

    3.1 Pixelization of Polygon PairsAs measured in the previous section, computing the exact

    boundaries of polygon intersection/union incurs enormousoverhead and has been the main cause to the low perfor-mance of SDBMSs in processing cross-comparing queries.However, the most relevant component to the definition ofJaccard similarity (as shown in Formula 1) is the areas, not

  • Figure 3: Polygons extracted from medical images have axis-aligned edges and integer-valued vertices.

    the intermediate boundaries. As a key to enable paralleliza-tion on GPUs, PixelBox directly computes the areas withoutresorting to the exact forms of the intersection or union.

    Polygons extracted from medical images share a commonproperty: the coordinates of vertices are integer-valued, anddirections of edges are either horizontal or vertical. Thiskind of polygons are a special form of rectilinear polygons[32]. As illustrated in Figure 3, since medical images aremade of pixels, the boundary of a segmented polygon fol-lows the regular, squared grid lines defined by the unit-sizedpixels.

    Taking advantage of this property, PixelBox treats a poly-gon as a continuous region surrounded by its spatial bound-ary on a pixel map. As shown in Figure 4(a), pixels withinthe MBR of polygons p and q can be classified into threecategories: 1) pixels (e.g., A) lying inside both p and q, 2)pixels (e.g., B and C) lying inside one polygon but not theother, and 3) pixels (e.g., D) lying outside both. The areaof intersection (‖p ∩ q‖) can be measured by the numberof pixels belonging to the first category. The area of union(‖p∪q‖) corresponds to the number of pixels in the first andsecond categories.Finally, pixels in the third category do notcontribute to either ‖p ∩ q‖ or ‖p ∪ q‖. The pixelized viewof polygon intersection and union averts the hassle of com-puting boundaries and, more importantly, exposes a greatopportunity for exploiting fine-grained data parallelism hid-den in the cross-comparing computation.

    In order to determine a pixel’s position relative to a poly-gon, a well-known method is to cast a ray from the pixel andcount its number of intersections with the polygon’s bound-ary [27]. As illustrated in Figure 4(b), if the number is odd,the pixel (e.g., A) lies inside the polygon; if the number iseven, the pixel (e.g., B) lies on the outside.

    The pixelization method is very suitable for execution onGPUs. Since testing the position of one pixel is totally in-dependent of another, we can parallelize the computationby having multiple threads processing the pixels in parallel.Moreover, since the positions of different pixels are com-puted against the same pair of polygons, the operations per-formed by different threads follow the SIMD fashion, whichis required by GPUs. Finally, the area of intersection andthe area of union can be computed altogether during a singletraversal of all pixels with almost no extra overhead, becausethe criteria for testing intersection (which uses boolean ANDoperation) and union (which uses boolean OR operation) areall based on each pixel’s positions relative to the same poly-gon pair. As the number of input polygon pairs is plenty,we can delegate them to multiple thread blocks. For eachpolygon pair, the contributions of the pixels in the MBR tothe area of intersection and area of union can be processedby all threads of a thread block in parallel.

    (a) A pixelized view ofpolygon intersection andunion.

    (b) Determination of apixel’s position relative toa polygon.

    (c) Using sampling boxesto reduce compute inten-sity.

    (d) PixelBox combinespixelization and sampling-box approaches.

    Figure 4: The principles of PixelBox.

    3.2 Reduction of Compute IntensityThe pixelization method described above has a weakness

    — the compute intensity rises quickly as the number of pix-els contained in the MBR increases. Even though polygonsare usually very small in pathology imaging applications, asthe resolution of scanner lens increases, the sizes of polygonsmay also increase accordingly to capture more details of theobjects. There are also cases when the areas of intersectionand union are computed between a small group of relativelylarge polygons and a large number of small polygons, e.g.,when processing an image with a few capillary vessels sur-rounded by many cells, because the MBR of a relativelylarge polygon covers a wide area. Moreover, as will be ver-ified in Section 5, even when polygons are small, it is stillpossible to further bring down the compute intensity andimprove algorithm performance.

    To reduce the intensity of computation and make the al-gorithm more scalable, PixelBox utilizes another technique,called sampling boxes, whose idea is similar to the adaptivemesh refinement method [8] used in numerical analysis. Dueto the continuity of the interior of a polygon, the positions ofpixels have spatial locality — if one pixel lies inside (or out-side) a polygon, other pixels in its neighborhood are likelyto lie on the inside (or outside) too, with exceptions near thepolygon’s boundary. Exploiting this property, we calculatethe areas of intersection and union region by region, insteadof pixel by pixel, so that the contribution of all pixels in aregion may be computed at once.

    This technique is visualized in Figure 4(c). The MBRof a polygon pair is recursively partitioned into samplingboxes, which are first created at coarser granularity (see thelarge grid cells in the figure), then going finer at selectedsub-regions (e.g., as shown by the small boxes near the top)which need further exploration. For example, when comput-ing the area of intersection, if a sampling box lies completelyinside both polygons, the contribution of all pixels within thesampling box to the area of intersection is obtained at once,which equals the size of the sampling box; otherwise, thesampling box will be partitioned into smaller sub-samplingboxes, with each sub-sampling box to be tested further. In

  • Figure 5: A sampling box’s position relative to a polygon: (a)outside; (b) inside; (c, d) hover.

    Figure 4(c), the gray sampling boxes do not need to be fur-ther partitioned because their contributions to the areas ofintersection and union are already determined.

    Similar to the pixelization method, a critical step of thesampling-box approach is to determine a sampling box’s po-sition relative to a polygon, which has three possible values:1) inside — every pixel in the sampling box lies inside thepolygon; 2) outside — every pixel in the sampling box liesoutside the polygon; and 3) hover — some pixels lie insidewhile others lie outside the polygon.

    Lemma 1. A sampling box’s position relative to a polygonis determined by three conditions: (i) none of the samplingbox’s four edges crosses through the polygon’s boundary; (ii)none of the polygon’s vertices lies inside the sampling box;(iii) sampling box’s geometric center lies inside the polygon.

    The sampling box lies inside the polygon when all threeconditions are true; it lies outside the polygon when the firsttwo conditions are true but the last condition is false; it hov-ers over the polygon in all other cases, that is when condition(i) or (ii) is false.

    Lemma 1 gives the criteria for computing a sampling box’sposition, which is further illustrated in Figure 5. For eachsampling box, its four edges are tested against the polygon’sboundary. If there are edge-to-edge crossings, the samplingbox must hover over the polygon (case (d) in Figure 5). Oth-erwise, if any vertex of the polygon lies inside the samplingbox, the entire polygon must be contained in the samplingbox due to the continuity of its boundary, in which case theposition is also hover (case (c) in Figure 5); if none of thepolygon’s vertices is inside the sampling box, the samplingbox may be either totally inside the polygon (case (b) in Fig-ure 5) or totally outside the polygon (case (a) in Figure 5),in which case the position of the sampling box’s geomet-ric center makes the distinction. If the sampling box’s fouredges overlap with the polygon’s boundary, the samplingbox’s position can be considered as either inside or outside.The next partition will distinguish the contribution of eachsub-sampling box to the areas of intersection and union.

    Testing the position of a sampling box is more costly thandoing this for a pixel. When the granularity of a sam-pling box is large, the extra overhead is compensated bythe amount of per-pixel computations reduced. However, assampling boxes are more fine-grained, the cost of computingtheir positions becomes more significant. Moreover, apply-ing sampling boxes requires synchronization between coop-erative threads — examination of one sampling box cannotstart until all threads have finished the partitioning of itsparent box. Frequent synchronization leads to low utiliza-tion of computing resources and has been one of the mainhazards to performance improvement on GPUs [35].

    To retain the merits of both efficient data parallelizationand low compute intensity, PixelBox combines pixelizationwith sampling-box techniques. As depicted in Figure 4(d),sampling boxes are applied at first to quickly finish testing

    for a large number of regions; when the size of a samplingbox becomes smaller than a threshold, T , the pixelizationmethod takes order and finishes the rest of the computation.

    Unlike the pure pixelization method, computing area ofintersection and area of union altogether will incur extraoverhead with sampling boxes. For example, if a samplingbox hovers over one polygon but lies outside the other, itscontribution is clear to the area of intersection, but unclearto the area of union; in this case, the sampling box will con-tinue to be partitioned until the area of union is settled orthe pixelization threshold is reached. To reduce the amountof sampling box partitionings applied and further improvealgorithm performance, the area of union is not computedtogether with area of intersection in PixelBox. Instead,similar to the query optimization in Figure 1(b), we com-pute the areas of polygons, and use the formula, ‖p ∪ q‖ =‖p‖+‖q‖−‖p∩q‖, to derive the areas of union. Computingthe area of a simple polygon is very easy to implement onGPUs. With formula3, A = 1

    2

    ∑n−1i=0 (xiyi+1 − xi+1yi), in

    which (xi, yi) is the coordinate of the ith vertex of the poly-gon, we can let different threads compute different verticesand sum up the partial results to get the area.

    3.3 Optimized Algorithm ImplementationAlgorithm 1 shows the pseudocode of PixelBox. Sampling

    boxes are created and examined recursively — one region isprobed from coarser to finer granularities before the nextone. A shared stack is used to store the coordinates of thesampling boxes and whether each sampling box needs to befurther partitioned. For each polygon pair allocated to athread block, its MBR is pushed to the stack as the firstsampling box (line 13). All threads keep popping the sam-pling box on the top of the stack to examine (line 18). Ifthe sampling box does not need to be further probed, thethreads will continue to pop the next sampling box (line 19 -20) until the stack becomes empty and the computation forthe polygon pair finishes. For a sampling box that needs tobe further probed, if its size is smaller than threshold T , thepixelization procedure is applied (line 23 - 28); otherwise, itis partitioned into sub-sampling boxes; after further exami-nation, new sampling boxes will be pushed to the stack byall threads simultaneously (line 30 - 39).

    In the algorithm, PolyArea(i) computes the partial areaof a polygon handled by thread i; BoxSize returns the num-ber of pixels contained in a sampling box; PixelInPoly(m, i, p)computes the position of the ith pixel in sampling box m rel-ative to polygon p; SubSampBox(b, i) partitions a samplingbox b and returns the ith sub-box for a thread to process;BoxPosition(b, p) computes the position of sampling boxb relative to polygon p; BoxContinue computes whethera sampling box needs to be further partitioned based onits positions relative to two polygons; and BoxContributecomputes whether a sampling box contributes to the area ofintersection according to its position.

    The use of a stack to store sampling boxes saves lots ofmemory space, and makes testing sampling box positionsand the generation of new sampling boxes parallelized. Asynchronization is required before popping a sampling box(line 17) to ensure that thread 0 or the last thread in thethread block has pushed the sampling box to the top of thestack. When threads push new sampling boxes to the stack,they do not overwrite the old stack top (line 37); otherwise,

    3See http://en.wikipedia.org/wiki/Polygon#Area and centroid

  • Algorithm 1 The PixelBox GPU algorithm.

    1: {pi, qi}i : the array of input polygon pairs2: {mi}i : the MBR of each polygon pair3: N : total number of polygon pairs4: stack[] : the shared stack containing sampling boxes5: I[N ][blockDim.x] : partial areas of intersections6: A[N ][blockDim.x] : partial summed areas of polygons7:8: procedure Kernel SampBox9: tid← threadIdx.x

    10: for i = blockIdx.x to N do11: A[i][tid]← A[i][tid] + PolyArea(pi)12: A[i][tid]← A[i][tid] + PolyArea(qi)13: Thread 0: stack[0]← {mi, 1}14: top← 115: while top > 0 do16: top← top− 117: syncthreads()18: {box, c} ← stack[top]19: if c = 0 then20: continue21: else22: if BoxSize(box) < T then23: for j ← tid to BoxSize(box) do24: φ1 ← PixelInPoly(box, j, pi)25: φ2 ← PixelInPoly(box, j, qi)26: I[i][tid]← I[i][tid] + (φ1 ∧ φ2)27: j ← j + blockDim.x28: end for29: else30: subbox← SubSampBox(box, tid)31: φ1 ← BoxPosition(box, pi)32: φ2 ← BoxPosition(box, qi)33: c← BoxContinue(φ1, φ2)34: t← BoxContribute(φ1, φ2)35: a← (1− c)× t× BoxSize(subbox)36: I[i][tid]← I[i][tid] + a37: stack[top+ 1 + tid]← {subbox, c}38: Thread 0: stack[top].c← 039: top← top+ 1 + blockDim.x40: end if41: end if42: end while43: i← i+ gridDim.x44: end for45: end procedure

    an extra synchronization would be required before pushingnew sampling boxes to ensure that the old stack top has beenread by all threads. In the current design, the old stack topis marked as ‘not to be further probed’, and will be omittedby all threads when being popped out of the stack again.

    The GPU kernel only computes the partial areas of inter-sections and the partial summed areas of polygons accumu-lated per thread (lines 5 - 6), which will be reduced lateron the CPU to derive the final areas of intersection andunion. Reduction is not performed on the GPU because thenumber of partial values for each polygon pair is relativelysmall (equal to the thread block size), which makes it notvery efficient to execute on the GPU. We measured the timetake by the reductions on a CPU core; the cost is negligiblecompared to other operations on the GPU.

    In the rest of this sub-section, we explain some optimiza-tions employed in the algorithm implementation.

    Utilize shared memory. Effectively using shared mem-ory has been a key to improve program performance onGPUs [30]. The sampling box stack is frequently read andmodified by all threads in a thread block, and thus shouldbe allocated in the shared memory. Meanwhile, the poly-gon vertices data are also repeatedly accessed by all threadswhen computing the positions of pixels or sampling boxes.Loading vertices into the shared memory reduces global mem-ory accesses and thus improves performance. Due to thelimited size of shared memory, it is infeasible to allocate forthe largest vertices array size. To make a trade off, we set astatic size for the shared memory region containing polygonvertices, and only those polygons whose number of verticesfits into the region will be loaded.

    Avoid memory bank conflicts. Bank conflicts happenwhen threads in a warp try to access different data itemsresiding in the same memory bank simultaneously. In thiscase, memory access is serialized which decreases bandwidthutilization. In the sampling box procedure, pushing newsampling boxes to the stack may incur bank conflicts if theinformation of each sampling box takes a continuous regionin the shared stack. This problem can be solved by sepa-rating the stack into five independent ones: four sub-stacksstore the coordinates of sampling boxes, and the fifth onestores whether a sampling box needs to be further probed.

    Perform loop unrolling. Computing pixel or samplingbox positions requires comparing with polygon edges in aloop. Unrolling the loop to have multiple polygon edgestested in a single iteration reduces the number of branchinstructions, and hides memory latencies more efficiently.

    3.4 Related DiscussionsPixelization threshold T . The value of the pixelization

    threshold has impact on the performance of PixelBox. Letthe number of threads in a thread block be n, an appropri-ate value of T should be chosen between n and n2. If T < n,the number of pixels contained in the last sampling box isless than the number of threads, which will not keep allthreads busy during the pixelization procedure; if T > n2,the last sampling box contains too many pixels, because itcould have been further partitioned at least once meanwhileguaranteeing all threads busy during the pixelization proce-

    dure. According to our testing (see Section 5.4), T = n2

    2is

    a very good choice.The accuracy of the algorithm. Pixelizing polygons

    may introduce errors into the areas computed. In the gen-eral sense, the finer the granularity of pixels is defined, themore accurate the computed result is. For pathology imag-ing analysis, however, PixelBox does not cause any loss ofprecision. As explained in Section 3.1, the areas computedequal the numbers of pixels actually lying inside the intersec-tion/union of polygons on the original image. This propertygeneralizes to polygons segmented from any pixel image inmedical imaging and other applications. We validated thecorrectness of PixelBox by comparing the areas computedby PixelBox with those computed by PostGIS, and find thatthe results are the same. We regard the generalization ofPixelBox to vectorized polygons as a future work.

    Implications of PixelBox to other spatial opera-tors. The principal ideas of PixelBox can also be usedto accelerate other compute-intensive spatial operators on

  • GPUs. For example, ST Contains can be implemented bycomputing the area of intersection and testing whether itequals the area of the object being contained. ST Touchescan be accelerated using ideas similar to PixelBox: comparethe edges of one polygon with the edges of the other, also testthe positions of vertices in one polygon relative to the edgesof the other; if there is no edge-to-edge crossing throughand at least one vertex of one polygon lies on the edge ofthe other, these two polygons touches each other; otherwise,they do not touch. We believe that a lot of frequently usedspatial operators in SDBMSs can be parallelized on GPUsby either directly utilizing the PixelBox algorithm or usingapproaches similar to PixelBox. This is another interestingtopic we would like to explore in the future.

    4. SYSTEM FRAMEWORKHaving presented our core GPU algorithm for comput-

    ing areas of intersection and union, we are now in a posi-tion to introduce how the whole workflow for spatial cross-comparison is implemented and optimized in a CPU/GPUhybrid environment. From the input of the raw text datafor polygons to the output of the final results, the work-flow consists of multiple logical stages. To fully exploit therich resources of the underlying CPU/GPU hardware, thesestages must be executed in a controllable and dynamicallyadaptable way. To achieve this goal, the system frameworkmust address the following three challenges:

    1. Data batching: Since GPU has a disconnected mem-ory space from CPU, input data batching for GPU isneeded to compensate the long latency of host-devicecommunication, considering the variation of data sizesand polygon counts from different images or tiles.

    2. GPU sharing: GPU is an exclusive, non-preemptivecompute device [20], and has only limited registersand shared memory capacity in each multiprocessor.Therefore, uncontrolled kernel invocations for polygonpairs from different images/tiles could cause resourcecontentions and low execution efficiency on GPU.

    3. CPU/GPU balancing: For different machine con-figurations, the whole workflow must balance the tasksexecuted on CPU and GPU. That is because 1) CPU’spowerful computing resources should not be wastedand 2) our PixelBox algorithm can effectively removethe performance bottleneck of cross-comparison.

    In this section, we present our system framework solu-tion. We first introduce our pipelined structure for thewhole workload, and then present our dynamic task migra-tion mechanism between CPU and GPU.

    4.1 The Pipelined StructureWe have designed and implemented a pipelined struc-

    ture for the whole workload. Through inter-stage buffers orlatches, task productions and consumptions are overlappedto improve resource utilization and system throughput. Asdepicted in Figure 6, the spatial cross-comparing pipelinecomprises four stages:

    1. The parser loads polygon files and transforms the for-mat of polygons from text to binaries. This stage ex-ecutes on CPUs with multiple worker threads.

    2. The builder builds spatial indexes on the transformedpolygon data. Since polygons are small, Hilbert R-Tree [19] is used to accelerate index building. This

    stage executes on CPUs in a single thread, because itsexecution speed is already very fast.

    3. The filter performs a pairwise index search on the poly-gons parsed from each two polygon files, and gener-ates an array of polygon pairs with intersecting MBRs.Similar to the builder, this stage also executes on CPUswith a single worker thread.

    4. The aggregator computes the areas of intersection andunion for each polygon array, using our PixelBox al-gorithm. The ratios of area of intersection to area ofunion are then aggregated to derive the Jaccard simi-larity value for a whole image. Polygon pairs that donot actually intersect, i.e., with the area of intersectionequal to zero, will not be considered.

    A computation task at each pipeline stage is defined at theimage tile scale. For example, an input task for the parser isto parse two polygon files segmented from the same imagetile; an input task for the builder is to build indexes onthe two sets of polygons parsed by a single parser task. Inpractice, a digital image slide may contain hundreds of smallimage tiles; each tile may contain thousands of polygons.The granularity of tasks defined at image tile level matchesthe image segmentation procedure, and allows the workloadsto propagate through the pipeline in a balanced way.

    Utilizing such a pipelined framework is critical to solve theaforementioned challenges. First, the work buffers betweenpipeline stages provide natural support for GPU input databatching. For example, since the number of polygon pairsfiltered may be drastically different from tile to tile, it is nec-essary for the aggregator to group multiple small tasks in itsinput buffer and send them in a batch to the GPU at once.Second, with a pipelined framework, a single instance of theaggregator consolidates all kernel invocations to the GPUs,which greatly reduces unnecessary contentions and makesexecution more efficient. Finally, the pipelined frameworkcreates a convenient environment for load balancing betweenCPU and GPU, as will be introduced next.

    4.2 Dynamic Task MigrationBased on the pipelined structure, we have built a task mi-

    gration component for the whole workflow to achieve loadbalancing between CPUs and GPUs. First, we have portedthe PixelBox algorithms to CPUs (called PixelBox-CPU),which is straightforward since CPUs are more flexible withprogram execution patterns, and parallelized its executionwith multiple threads. Second, we have also designed andimplemented a GPU-kernel for the parser stage (called GPU-Parser), whose performance is only comparable to its CPUcounterpart since text parsing requires implementing a fi-nite state machine, which has been shown not efficient forparallel execution [7]. In this way, the parser and the aggre-gator stages are flexible to execute tasks on both CPUs andGPUs, which creates an opportunity of balancing workloaddistributions throughput dynamic task migrations.

    What must be noted is that the task migration relieson a special feature of the pipelined framework to detectworkload imbalance from the application level. The workbuffers between pipeline stages give useful indication on theprogress of computation and the status of compute devices.Specifically, if the input buffer of the aggregator stage be-comes full, the migrator knows that this stage is making slowprogress and the GPUs have been congested. On the otherhand, if the input buffer of the aggregator stage becomes

  • Figure 6: A cross-comparing pipeline with dynamic task migrations.

    empty, it indicates that the GPUs are being under-utilized.In each case, tasks are dynamically migrated from GPUs toCPUs, or from CPUs to GPUs, to mitigate resource under-utilization and improve system throughput.

    To implement the task migration scheme, two threads,called migration threads, are spawned — one for the aggre-gator stage, one for the parser stage. They usually stay inthe sleeping state and are only woken up when the inputbuffer of the aggregator stage becomes full or empty. In thecase of GPU congestion, the aggregator’s migration threadis woken up, which selects the smallest tasks from the in-put buffer of the aggregator and invokes PixelBox-CPU toexecute them. In the case of GPU idleness, the parser’smigration thread is woken up to fetch some tasks from theparser’s input buffer and execute them on GPUs. The designof the task migration component is illustrated in Figure 6.

    5. EXPERIMENTSThis section evaluates our SCCG solution, including the

    PixelBox algorithm and the system framework. We have im-plemented PixelBox and GPU-Parser with NVIDIA CUDAC 4.0. Intel Threading Building Blocks [24], a popular work-stealing software library for task parallelization on CPUs,is used to parallelize text parsing and PixelBox-CPU. Thepipelined framework is developed using Pthreads. The dy-namic task migration component is built into the executionpipeline, and can be turned on or turned off according tothe requirements of different experiments.

    5.1 Experiment MethodologyWe perform experiments on two platforms. One is a Dell

    Precision T1500 workstation with an Intel Core i7 860 2.80GHzCPU (4 cores), an NVIDIA GeForce GTX 580 GPU card,and 8GiB main memory. The operating system is 64-bitRed Hat Enterprise Linux 6 with 2.6.32 kernel. We primar-ily use this platform to test the performance of the PixelBoxand the pipelined framework, and measure the overall per-formance of SCCG in cross-comparing all data sets. Theeffectiveness of the dynamic task migration component tooffload workloads from CPUs to GPUs is also validated onthis platform. The other platform is an Amazon EC2 in-stance with two Intel Xeon X5570 2.93GHz CPUs (totally8 cores, 16 threads) and two NVIDIA Tesla M2050 GPUs.The size of the main memory is 22 GiB. The operating sys-tem is 64-bit CentOS with 2.6.18 linux kernel. We primarilyuse this platform to test the task migration component andmeasure the performance of a parallel PostGIS solution tocross-compare the whole data sets. The version of PostGIS(PostgreSQL) we used is 9.1.3.

    Our experiments use 18 real-world data sets extractedfrom 18 digital pathology images used in a brain tumor re-search at the authors’ institution. The total size of the datasets in raw text format is about 12GiB. The average sizeof polygons is about 150 in the number of pixels contained,

    with the standard deviation around 100. The average num-ber of polygons in each data set is about half million, withthe largest data set containing over 2 millions.

    In all experiments performed in this paper, we do notconsider data loading or disk I/O time for the purpose offair comparison. First, it is well known that the databasesystem has high loading overhead when processing one-passdata with the “first-load-then-query” data processing model.SCCG averts this problem through customized text parsingand pipelined execution to process the polygon stream on-the-fly. Second, disk I/O, even though still a significantperformance factor for SDBMSs, is not longer the severestbottleneck for cross-comparing queries in SDBMSs. Mosttime is spent on computation. The effect of disk I/O canbe further mitigated through SCCG’s pipelined frameworkby adding a disk pre-fetcher in front of the parser stage tosequentially load polygon files into main memory ahead ofparsing. The use of more advanced storage devices, suchas SSD and disk arrays, can also reduce disk I/O time sig-nificantly. Thus, in the following experiments, we assumethat the polygon data is already loaded into main memoryor imported into the database before queries or the pipelineis executed.

    5.2 Performance of the PixelBox AlgorithmIn this subsection, we evaluate the performance of Pixel-

    Box and verify some design decisions discussed earlier. Theexperiments are carried out on the T1500 workstation. SincePostGIS uses GEOS as its geometric computation library,we use the performance of GEOS on a singe core as thebaseline in respective experiments. Optimizations similar tothe query in Figure 1(b) is used in the baseline to avoid theheavy function call for polygon unions. We select a represen-tative data set, called oligoastroIII 1, for the experiments.It contains 462016 polygons in one polygon set and 458878polygons in the other. Totally, 619609 pairs of polygons,whose MBRs intersect, are filtered.

    1

    10

    100

    1000

    GEOS PixelBox

    Exe

    cuti

    on

    Tim

    e (s

    )

    1

    10

    100

    1000

    PixelBox

    Sp

    eed

    up

    Figure 7: Performance Comparison of GEOS and PixelBox.

    We first show the overall performance of GEOS and Pixel-Box in computing the areas of intersection and union for all619609 polygon pairs in Figure 7. Both absolute executiontimes and relative speedups are shown in logarithmic scales.It takes over 430 seconds using GEOS to compute the areasof intersection and union for all polygon pairs. Comparedwith GEOS, PixelBox is over two orders of magnitude faster,finishing all computations within only 3.6 seconds.

  • Figure 8: Performance of two algorithm decisions: using sam-pling boxes and computing areas of union indirectly.

    In order to validate several algorithm design decisions, i.e.,using sampling boxes to reduce compute intensity, and com-puting area of union indirectly, we do a stress testing withPixelBox using a set of 15724 polygon pairs filtered from tworepresentative polygon files in oligoastroIII 1. We increasethe polygon sizes by multiplying the coordinates of polygonvertices with a scale factor whose value varies from 1 to 5.The data sets used in this paper are extracted from slideimages captured under 20x objective lens. Considering thatthe resolution of objective lens commonly used is around 40xat the maximum (which increases the sizes of polygons by4 times), scaling up the coordinates of polygons by a max-imum factor of 5 (which increases the sizes of polygons by25 times) is more than sufficient.

    We compare the performance of PixelBox with two baseversions: one that uses only the pixelization method (calledPixelOnly), the other that combines pixelization and sampling-box techniques but computes both area of intersection andarea of union directly (called PixelBox-NoSep). We tunethe grid size, block size, and T (for PixelBox-NoSep andPixelBox), so that all algorithms execute in their best per-formance. Their execution times are shown in Figure 8.

    In all scale factors, the performance of PixelBox-NoSep isconsistently higher than PixelOnly with the use of samplingboxes, while PixelBox beats the performance of PixelBox-NoSep by further reducing the amount of sampling box par-titionings performed. When the scale factor is 1, the over-head of per-pixel examination is relatively low because thesizes of polygons are small. But PixelBox-NoSep and Pixel-Box still out-perform PixelOnly in this case, reducing execu-tion time by 28% and 34% respectively. As the scale factorincreases, the performance of PixelOnly drops rapidly dueto the dramatic increase of the number of pixels that mustbe handled by the algorithm. However, the performance ofPixelBox-NoSep and PixelBox only degrades slightly. Asthe scale factor reaches 5, that is when the sizes of polygonsare increased by 25 times, PixelBox-NoSep improves overPixelOnly by reducing execution time by over 50%, whilePixelBox shortens the execution time even further by 73%compared with PixelBox-NoSep. This experiment verifiesthe effectiveness of using sampling boxes to reduce computeintensity. It also shows that, by computing areas of union in-directly, the performance of the algorithm can be further en-hanced due to reduced sampling box partitions. It has to benoted that the performance of PixelOnly, PixelBox-NoSepand PixelBox are much higher than the GEOS baseline atall scale factors (it takes GEOS over 11 seconds).

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    SF1 SF3 SF5

    Sp

    eed

    up

    Scale Factor

    PixelBox-NoOpt PixelBox-NBC PixelBox-NBC-UR PixelBox-NBC-UR-SM

    Figure 9: Performance impact of various optimization tech-niques in algorithm implementation.

    5.3 Effectiveness of Optimization TechniquesOn the T1500 workstation, we evaluate the effectiveness of

    various optimization techniques employed during algorithmimplementation, i.e., using shared memory (for polygon ver-tices data), avoiding bank conflicts, and loop unrolling. Wetake the same set of 15724 polygon pairs used in the pre-vious experiment, with the scaling factors being 1, 3, and5, and measure the execution times of four variants of thePixelBox algorithm: PixelBox-NoOpt denotes the base ver-sion in which none of the optimization techniques are used;PixelBox-NBC denotes the version when bank conflicts areavoided; PixelBox-NBC-UR denotes the version when bankconflicts are avoided and loop unrolling is performed; finally,PixelBox-NBC-UR-SM denotes the version when all opti-mizations are utilized. In all variants, the sampling boxstack is always allocated in shared memory, because other-wise a global heap whose size is proportional to the totalnumber of threads in the whole grid has to be allocated,which we consider an unreasonable design scheme.

    The performance of each variant normalized to PixelBox-NoOpt is shown in Figure 9. It can be seen that the opti-mization techniques discussed above are effective in improv-ing the performance of PixelBox. When the scale factoris 1, the performance is improved by a factor of 1.14 afterall optimization techniques are used; when the scale factoris 5, the speedup raises to a factor of 1.30. The weightsof different optimization techniques to the performance arevaried. The effects of loop unrolling and using shared mem-ory are more significant than that of avoiding bank conflicts.This is because PixelBox spends more time on computingthe positions of pixels and sampling boxes than on gener-ating new sampling boxes. Thus, loop unrolling and usingshared memory, which improves the efficiency of computingpositions, play a larger role in performance of PixelBox.

    5.4 Parameter Sensitivity of PixelBoxIn order to test the sensitivity of algorithm performance

    to the pixelization threshold T , we take the same set of15724 polygon pairs used above and measure how the ex-ecution time of PixelBox varies as we change the value ofT . We do the experiments on the T1500 workstation. Weset the thread block size to 64, and the performance trendin each scale factor (SF1 to SF5) is shown in Figure 10.The result verifies our analysis for choosing the value of T .The performance of PixelBox is sub-optimal when T is toosmall or too large. It performs the best when the value of Tlies between 512 and 4096, which corresponds to the rangefrom n2/8 to n2, in all scale factors. We also repeated the

  • Figure 10: The sensitivity of pixelization threshold T .

    experiment when setting thread block size to other values,and the trend was similar. But when the block size is toolarge (e.g., >= 256), the overall performance of PixelBoxdegrades. This is because less thread blocks can run con-currently on a multiprocessor and the sampling box parti-tioning will be less fine-grained when block size is too large.According to our experience, setting n to a small value andthe value of the pixelization threshold around n2/2 achievesthe highest performance.

    5.5 Performance of the Pipelined FrameworkWe evaluate the performance of the pipelined framework

    in this subsection. Task migration is disabled to removeits influence to the pipeline’s performance. On the T1500workstation, we collect the execution times of four schemesthat cross-compares the oligoastroIII 1 data set:• PostGIS-S executes the optimized query shown in Fig-

    ure 1(b) with PostGIS on a single core;• NoPipe-S uses a single execution stream that executes

    a non-pipelined version of the framework in Figure 6,in which the four stages execute sequentially on eachpair of input polygon files without pipelining;• NoPipe-M represents the thread-parallel execution scheme

    where multiple execution streams (4 streams on a four-core CPU) are launched with each one invoking NoPipe-S independently;• Pipelined is the fully pipelined scheme used in SCCG.

    The result is shown in Figure 11, with speedup num-bers normalized against PostGIS-S baseline. The speed ofNoPipe-S is determined by the cumulative execution timesof all pipeline stages. Since the bottleneck stage of thepipeline has been accelerated by PixelBox on GPUs, NoPipe-S achieves over 37-fold speedup compared with PostGIS-S. NoPipe-M improves over NoPipe-S and achieves a 63-

    0

    10

    20

    30

    40

    50

    60

    70

    80

    PostGIS-S NoPipe-S NoPipe-M Pipelined

    Sp

    eed

    up

    Execution Scheme

    Figure 11: Performance comparisons between different schemes.

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    Config-I Config-II Config-III

    No

    rmal

    ized

    Th

    rou

    gh

    pu

    t

    Platform Configurations

    SCCG-NoMig

    SCCG-Mig

    Figure 12: Performance benefits of dynamic task migration.

    folder speedup over PostGIS-S, because simultaneously is-suing multiple streams improves the utilization of resources.However, due to the stream serialization caused by incoor-dinated concurrent use of GPUs on the last stage, the CPUresource cannot be well utilized. We measured the CPUutilization during the execution of NoPipe-M and observedthat all CPU cores were only about 50% saturated all thetimes, which confirmed our analysis. The Pipelined schemeachieves the best performance by accelerating the the speedof cross-comparing by a factor of 76 compared with PostGIS-S. The result justifies the use of the pipelined framework andshows the importance of coordination when using GPUs.

    5.6 Effectiveness of Dynamic Task MigrationIn order to verify the design of the task migration com-

    ponent, we perform experiments in three different platformconfigurations: (I) on the T1500 workstation, we evaluatethe effectiveness of the task migration component in offload-ing workloads from CPUs to GPUs; (II) on Amazon EC2with both GPU cards used, we evaluate the component’seffectiveness to offload workloads from CPUs to GPUs; and(III) on Amazon EC2 with only one GPU card used, weevaluate the component’s effectiveness to offload workloadsfrom GPUs to CPUs. Since the GPUs on both platformsare too powerful, in order to make the case of GPU-to-CPUworkload offloading happen, we purposely slow down Pixel-Box by selecting a sub-optimal thread block size in Config-uration III. In real system environment, due to concurrentsharing of GPUs with other applications, the speed of theaggregator stage may be lower, which is the case we want toemulate in Configuration III.

    The oligoastroIII 1 data set is used in experiments. Weshow the throughput of task-migration-enabled SCCG (SCCG-Mig) normalized to the throughput of task-migration-disabledSCCG (SCCG-NoMig) in each configuration. Throughputis defined as the size of data set divided by execution time.

    As Figure 12 shows, on T1500 workstation (Config-I), thethroughput of SCCG with dynamic task migration is about50% higher than SCCG without dynamic task migration.In this setting, the aggregator stage cannot keep the GPUfully occupied, which triggers the migrator to dynamicallyoffload tasks from the parser stage to execute on GPU. Thisimproves the performance of the parser stage and thus en-hances the throughput. On Amazon EC2 with both GPUsutilized (Config-II), the GPU resource still cannot be fullyutilized by the aggregator stage. Thus, workloads are mi-grated from CPUs to GPUs, and the throughput of thepipeline is improved by over 40%. The throughput improve-ment is lower than Config-I, because the CPUs are more

  • 0

    50

    100

    150

    200

    250

    300

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

    Exe

    cuti

    on

    Tim

    e (s

    )

    Data Set Index

    PostGIS-M SCCG

    (a) Absolute execution times.

    0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    Sp

    eed

    up

    Data Set Index

    (b) Relative speedups.

    Figure 13: The overall performance of SCCG compared withPostGIS-M on 18 data sets.

    powerful, which causes less workload offloaded to GPUs. OnAmazon EC2 with only one GPU utilized (Config-III), dy-namic task migration improves the pipeline throughput byover 14%. In this scenario, the aggregator stage becomes thebottleneck of the pipeline because the consumption cannotmatch up the generation of new aggregator tasks by otherstages executing on the CPUs. The aggregator tasks aremigrated to execute on CPUs in this case. But due to therelatively small speed gap between the parser and the aggre-gator stage and the limited performance of PixelBox-CPUon CPUs, the throughput improvement is smaller comparedto other configurations.

    5.7 Performance Evaluation with All Data SetsIn this section, we give the complete performance results

    of SCCG compared with a parallelized PostGIS solution overall 18 data sets. The experiments with SCCG are performedon the T1500 workstation with only one GPU card and a4-core CPU. The experiments with PostGIS are performedon the Amazon EC2 instance with both 4-core CPUs fullyutilized. Query executions in PostGIS are parallelized overall CPU cores by evenly partitioning polygon tables into 16chunks and launching 16 query streams to process differentchunks concurrently. We refer to this execution scheme asPostGIS-M. Being generous to PostGIS, we only considerindex building and query execution times; time spent onpartitioning polygon tables is not included. We measure thetimes taken by SCCG and PostGIS-M on cross-comparingeach data set, and both the absolute execution times andrelative speedups are presented.

    As Figure 13(a) shows, it takes PostGIS-M over 1120 sec-onds to process all data sets, while SCCG finishes all compu-tations within only 64 seconds. The varied execution timesof PostGIS-M and SCCG on different data sets are due to

    the different numbers and sizes of polygons in each data set.For example, the first data set contains only 20 polygon files(each segmentation algorithm generated 10 files) and about57000 polygons; while the last data set comprises a total of442 polygon files, with over 4 million polygons contained.Figure 13(b) shows the relative speedups. Among all datasets, SCCG achieves a minimum of 13-fold speedup and amaximum of over 44-fold speedup compared with PostGIS-M. The last column of Figure 13(b) gives the geometric meanof speedups across all data sets, which is over 18 times.

    The result shows the effectiveness of our SCCG solutionin improving the performance of spatial cross-comparison atlow cost. Two Intel X5570 CPUs cost over $2000 with thecurrent market price; while the total cost of Intel Core i7860 CPU and GTX580 GPU is only about $820.

    6. RELATED WORKSThough modern computer architecture has brought rich

    parallel resources, existing geometric algorithms for spatialoperations implemented in the widely used libraries (e.g.CGAL and GEOS) and in major SDBMSs are still single-threaded. There are several attempts of parallel algorithms.A parallel algorithm was proposed in [12] to compute theareas of intersection and union on CPUs. The algorithmwas not designed to execute in SIMD fashion, which hasbeen the key to achieve high performance on both CPUsand GPUs in the era of high-throughput computing [25].As a numerical approximation method, Monte Carlo [11]can be used to compute the areas of intersection and unionon GPUs, by repeatedly generating randomized samplingpoints and counting the number of points lying within the re-gion. However, repeated casting of random sampling pointsmakes Monte Carlo much more compute-intensive than ouroptimized PixelBox algorithm. A paper [31] proposed to testpolygon intersections by drawing polygons on a frame bufferthrough the OpenGL interfaces and counting the number ofpixels with specific colors. This method could be extendedto compute the areas of intersection and union, but it wouldsuffer a similar performance problem like the pure pixeliza-tion approach due to high compute intensity.

    Prior work have proposed optimized algorithms and im-plementations for various database operations on the GPUarchitecture, including join [16], selection and aggregation[14], sorting [13], tree search [22], list intersection and indexcompression [6], and transaction execution [17]. Moreover,using a CPU-GPU hybrid environment to accelerate foreign-key joins has been explored in the paper [29]. Comparedwith these works, we focus on optimizing spatial operationsfor image comparisons in a CPU/GPU hybrid environment.In addition, considering our system execution framework,related work about the utilization of pipelined executionparallelism can be found in parallel database systems [28]and optimized data-sharing query execution engine [15].

    7. CONCLUSIONSWe have presented our solution for fast cross-comparison

    of pathology imaging data in a CPU/GPU hybrid environ-ment. After a thorough profiling of a spatial database so-lution, we identified the performance bottleneck of comput-ing areas of intersection and union on polygon sets. OurPixelBox algorithm and its implementation on GPU can

  • fundamentally remove the performance bottleneck. More-over, our pipelined structure with dynamic task migrationcan efficiently execute the whole workload using CPU andGPU. Our solution had been verified through extensive ex-periments. It achieves more than 18X speedup over paral-lelized PostGIS when processing real-world pathology data.

    8. REFERENCES[1] postgis.refractions.net.

    [2] www.scidb.org.

    [3] trac.osgeo.org/geos/.

    [4] www.cgal.org.

    [5] developer.nvidia.com/category/zone/cuda-zone.

    [6] N. Ao, F. Zhang, D. Wu, D. S. Stones, G. Wang,X. Liu, J. Liu, and S. Lin. Efficient parallel listsintersection and index compression algorithms usinggraphics processing units. PVLDB, 4(8), 2011.

    [7] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny,K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson,K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. AView of the Parallel Computing Landscape. Commun.ACM, 52:56–67, Oct. 2009.

    [8] M. J. Berger and J. Oliger. Adaptive mesh refinementfor hyperbolic partial differential equations. Journal ofComputational Physics, 53(3):484 – 512, 1984.

    [9] L. A. D. Cooper, J. Kong, D. A. Gutman, F. Wang,J. Gao, C. Appin, S. Cholleti, T. Pan, A. Sharma,L. Scarpace, T. Mikkelsen, T. Kurc, C. S. Moreno,D. J. Brat, and J. H. Saltz. Integrated morphologicanalysis for the identification and characterization ofdisease subtypes. Journal of the American MedicalInformatics Association, Jan. 2012.

    [10] M. de Berg, M. van Kreveld, M. Overmars, andO. Schwarzkopf. Computational geometry: algorithmsand applications. 1997.

    [11] G. S. Fishman. Monte Carlo: Concepts, Algorithms,and Applications. 1996.

    [12] W. R. Franklin. Calculating Map Overlay Polygons’Areas without Explicitly Calculating the Polygons –Implementation. In SDH, 1990.

    [13] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha.GPUTeraSort: High Performance GraphicsCo-processor Sorting for Large DatabaseManagement. In SIGMOD, 2006.

    [14] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, andD. Manocha. Fast Computation of DatabaseOperations using Graphics Processors. In SIGMOD,2004.

    [15] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki.QPipe: A Simultaneously Pipelined Relational QueryEngine. In SIGMOD, 2005.

    [16] B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju,Q. Luo, and P. Sander. Relational Joins on GraphicsProcessors. In SIGMOD, 2008.

    [17] B. He and J. X. Yu. High-throughput transactionexecutions on graphics processors. PVLDB, 4(5), 2011.

    [18] J. L. Hennessy and D. A. Patterson. ComputerArchitecture, Fifth Edition: A Quantitative Approach.5th edition, 2011.

    [19] I. Kamel and C. Faloutsos. Hilbert r-tree: Animproved r-tree using fractals. In VLDB’94, 1994.

    [20] S. Kato, K. Lakshmanan, R. Rajkumar, andY. Ishikawa. TimeGraph: GPU Scheduling forReal-Time Multi-Tasking Environments. In USENIXATC, 2011.

    [21] M. L. Kersten, S. Idreos, S. Manegold, and E. Liarou.The Researcher’s Guide to the Data Deluge: Queryinga Scientific Database in Just a Few Seconds. InVLDB, 2011.

    [22] C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D.Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, andP. Dubey. FAST: Fast Architecture Sensitive TreeSearch on Modern CPUs and GPUs. In SIGMOD,2010.

    [23] D. B. Kirk and W.-m. W. Hwu. ProgrammingMassively Parallel Processors: A Hands-on Approach.1st edition, 2010.

    [24] A. Kukanov and M. J. Voss. The Foundations forScalable Multi-core Software in Intel R© ThreadingBuilding Blocks. Intel Technology Journal, 11:309–322,Nov 2007.

    [25] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim,A. D. Nguyen, N. Satish, M. Smelyanskiy,S. Chennupaty, P. Hammarlund, R. Singhal, andP. Dubey. Debunking the 100X GPU vs. CPU Myth:An Evaluation of Throughput Computing on CPUand GPU. In ISCA, 2010.

    [26] S. Loebman, D. Nunley, Y. Kwon, B. Howe,M. Balazinska, and J. P. Gardner. Analyzing massiveastrophysical datasets: Can pig/hadoop or a relationaldbms help? In CLUSTER, pages 1–10, 2009.

    [27] J. O’Rourke. Computational Geometry in C. 1998.

    [28] J. Patel, J. Yu, N. Kabra, K. Tufte, B. Nag, J. Burger,N. Hall, K. Ramasamy, R. Lueder, C. Ellmann,J. Kupsch, S. Guo, J. Larson, D. De Witt, andJ. Naughton. Building a Scaleable Geo-Spatial DBMS:Technology, Implementation, and Evaluation. InSIGMOD, 1997.

    [29] H. Pirk, S. Manegold, and M. L. Kersten. AcceleratingForeign-Key Joins Using Asymmetric MemoryChannels. In VLDB - Workshop on Accelerating DataManagement Systems Using Modern Processor andStorage Architectures, 2011.

    [30] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S.Stone, D. B. Kirk, and W.-m. W. Hwu. OptimizationPrinciples and Application Performance Evaluation ofa Multithreaded GPU Using CUDA. In PPoPP, 2008.

    [31] C. Sun, D. Agrawal, and A. El Abbadi. HardwareAcceleration for Spatial Selections and Joins. InSIGMOD, 2003.

    [32] L. Surhone, M. Tennoe, and S. Henssonow. RectilinearPolygon. 2010.

    [33] P.-N. Tan, M. Steinbach, and V. Kumar. Introductionto Data Mining. 1st edition, 2005.

    [34] F. Wang, J. Kong, L. Cooper, T. Pan, T. Kurc,W. Chen, A. Sharma, C. Niedermayr, T. Oh, D. Brat,A. Farris, D. Foran, and J. Saltz. A data model anddatabase for high-resolution pathology analyticalimage informatics. Journal of Pathology Informatics,2(1):32, 2011.

    [35] Y. Zhang and J. D. Owens. A QuantitativePerformance Analysis Model for GPU Architectures.In HPCA, 2011.