graphh: a processing-in-memory architecture for large

14
640 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 38, NO. 4, APRIL 2019 GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing Guohao Dai, Student Member, IEEE, Tianhao Huang, Yuze Chi, Jishen Zhao, Member, IEEE, Guangyu Sun, Member, IEEE, Yongpan Liu , Senior Member, IEEE, Yu Wang , Senior Member, IEEE, Yuan Xie, Fellow, IEEE, and Huazhong Yang, Senior Member, IEEE Abstract—Large-scale graph processing requires the high bandwidth of data access. However, as graph computing con- tinues to scale, it becomes increasingly challenging to achieve a high bandwidth on generic computing architectures. The pri- mary reasons include: the random access pattern causing local bandwidth degradation, the poor locality leading to unpredictable global data access, heavy conflicts on updating the same vertex, and unbalanced workloads across processing units. Processing- in-memory (PIM) has been explored as a promising solution to providing high bandwidth, yet open questions of graph processing on PIM devices remain in: 1) how to design hardware specializa- tions and the interconnection scheme to fully utilize bandwidth of PIM devices and ensure locality and 2) how to allocate data and schedule processing flow to avoid conflicts and balance workloads. In this paper, we propose GraphH, a PIM architecture for graph processing on the hybrid memory cube array, to tackle all four problems mentioned above. From the architecture perspective, we integrate SRAM-based on-chip vertex buffers to eliminate local bandwidth degradation. We also introduce reconfigurable double-mesh connection to provide high global bandwidth. From the algorithm perspective, partitioning and scheduling methods like index mapping interval-block and round interval pair are introduced to GraphH, thus workloads are balanced and conflicts are avoided. Two optimization methods are further introduced to reduce synchronization overhead and reuse on-chip data. The experimental results on graphs with billions of edges demonstrate that GraphH outperforms DDR-based graph processing systems Manuscript received November 9, 2017; revised January 25, 2018; accepted March 7, 2018. Date of publication March 30, 2018; date of current ver- sion March 19, 2019. This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0207600, in part by the National Natural Science Foundation of China under Grant 61622403 and Grant 61621091, in part by the Joint Fund of Equipment Pre-Research and Ministry of Education under Grant 6141A02022608, and in part by the Beijing National Research Center for Information Science and Technology. The work of G. Dai was supported by the China Scholarship Council. This paper was recommended by Associate Editor X. Li. (Corresponding author: Yu Wang.) G. Dai, T. Huang, Y. Liu, Y. Wang, and H. Yang are with the Department of Electronic Engineering, Tsinghua University, Beijing 100084, China, and also with the Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China (e-mail: [email protected]; [email protected]). Y. Chi is with the Computer Science Department, University of California at Los Angeles, Los Angeles, CA 90095 USA. J. Zhao is with the Computer Science and Engineering Department, Jacobs School of Engineering, University of California at San Diego, La Jolla, CA 92093 USA. G. Sun is with the Center for Energy-Efficient Computing and Applications, School of EECS, Peking University, Beijing 100871, China. Y. Xie is with the Department of Electrical and Computer Engineering, University of California at Santa Barbara, Santa Barbara, CA 93106 USA. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCAD.2018.2821565 by up to two orders of magnitude and 5.12× speedup against the previous PIM design. Index Terms—Hybrid memory cube (HMC), large-scale graph processing, memory hierarchy, on-chip networks. I. I NTRODUCTION A S WE are now in the “big data” era, the data volume collected from various digital devices has skyrocketed in recent years. Meanwhile, ever-growing analysis demand over these data brings tremendous challenges to both analytical models and computing architectures. Graph, a conventional data structure, can both store the data value and represent the relationships among data. The large-scale graph processing problem is gaining increasing attention in various domains. Many systems have been proposed and achieved significant performance improvement over large-scale graph process- ing problems, including Tesseract [1], Graphicionado [2], GraphPIM [3], Gemini [4], GraphLab [5], etc. [6]–[9]. The essential way to improve the performance of large- scale graph processing is to provide a higher bandwidth of data access. However, achieving high bandwidth of graph processing on conventional architectures suffers from four challenges. 1) Random Access: The data access pattern in major graph algorithm is highly irregular, which leads to consider- able local bandwidth degradation (e.g., >90% band- width degradation on CPU-DRAM hierarchy [10]) on conventional computation architecture. 2) Poor Locality: A simple operation on one vertex may require access to all its neighbor vertices. Such poor locality leads to irregular global data access and inefficient bandwidth utilization (e.g., 57% bandwidth degradation on two-hop [11]) on multiprocessing units architecture. 3) Unbalanced Workloads: The graph is highly unstruc- tured, thus the computation workloads of various pro- cessing units can be quite unbalanced. 4) Heavy Conflicts: Simultaneous updating to the same ver- tex by different processing units causes heavy conflicts. All these four challenges need to be addressed from both architecture and algorithm perspectives. Processing-in-memory (PIM) has been put forward as a promising solution to providing high bandwidth for big data 0278-0070 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 16-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GraphH: A Processing-in-Memory Architecture for Large

640 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 38, NO. 4, APRIL 2019

GraphH: A Processing-in-Memory Architecturefor Large-Scale Graph Processing

Guohao Dai, Student Member, IEEE, Tianhao Huang, Yuze Chi, Jishen Zhao, Member, IEEE,

Guangyu Sun, Member, IEEE, Yongpan Liu , Senior Member, IEEE, Yu Wang , Senior Member, IEEE,Yuan Xie, Fellow, IEEE, and Huazhong Yang, Senior Member, IEEE

Abstract—Large-scale graph processing requires the highbandwidth of data access. However, as graph computing con-tinues to scale, it becomes increasingly challenging to achievea high bandwidth on generic computing architectures. The pri-mary reasons include: the random access pattern causing localbandwidth degradation, the poor locality leading to unpredictableglobal data access, heavy conflicts on updating the same vertex,and unbalanced workloads across processing units. Processing-in-memory (PIM) has been explored as a promising solution toproviding high bandwidth, yet open questions of graph processingon PIM devices remain in: 1) how to design hardware specializa-tions and the interconnection scheme to fully utilize bandwidth ofPIM devices and ensure locality and 2) how to allocate data andschedule processing flow to avoid conflicts and balance workloads.In this paper, we propose GraphH, a PIM architecture for graphprocessing on the hybrid memory cube array, to tackle all fourproblems mentioned above. From the architecture perspective,we integrate SRAM-based on-chip vertex buffers to eliminatelocal bandwidth degradation. We also introduce reconfigurabledouble-mesh connection to provide high global bandwidth. Fromthe algorithm perspective, partitioning and scheduling methodslike index mapping interval-block and round interval pair areintroduced to GraphH, thus workloads are balanced and conflictsare avoided. Two optimization methods are further introducedto reduce synchronization overhead and reuse on-chip data. Theexperimental results on graphs with billions of edges demonstratethat GraphH outperforms DDR-based graph processing systems

Manuscript received November 9, 2017; revised January 25, 2018; acceptedMarch 7, 2018. Date of publication March 30, 2018; date of current ver-sion March 19, 2019. This work was supported in part by the National KeyResearch and Development Program of China under Grant 2017YFA0207600,in part by the National Natural Science Foundation of China under Grant61622403 and Grant 61621091, in part by the Joint Fund of EquipmentPre-Research and Ministry of Education under Grant 6141A02022608,and in part by the Beijing National Research Center for InformationScience and Technology. The work of G. Dai was supported by the ChinaScholarship Council. This paper was recommended by Associate Editor X.Li. (Corresponding author: Yu Wang.)

G. Dai, T. Huang, Y. Liu, Y. Wang, and H. Yang are with theDepartment of Electronic Engineering, Tsinghua University, Beijing 100084,China, and also with the Beijing National Research Center for InformationScience and Technology, Tsinghua University, Beijing 100084, China (e-mail:[email protected]; [email protected]).

Y. Chi is with the Computer Science Department, University of Californiaat Los Angeles, Los Angeles, CA 90095 USA.

J. Zhao is with the Computer Science and Engineering Department, JacobsSchool of Engineering, University of California at San Diego, La Jolla,CA 92093 USA.

G. Sun is with the Center for Energy-Efficient Computing and Applications,School of EECS, Peking University, Beijing 100871, China.

Y. Xie is with the Department of Electrical and Computer Engineering,University of California at Santa Barbara, Santa Barbara, CA 93106 USA.

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCAD.2018.2821565

by up to two orders of magnitude and 5.12× speedup againstthe previous PIM design.

Index Terms—Hybrid memory cube (HMC), large-scale graphprocessing, memory hierarchy, on-chip networks.

I. INTRODUCTION

AS WE are now in the “big data” era, the data volumecollected from various digital devices has skyrocketed in

recent years. Meanwhile, ever-growing analysis demand overthese data brings tremendous challenges to both analyticalmodels and computing architectures. Graph, a conventionaldata structure, can both store the data value and represent therelationships among data. The large-scale graph processingproblem is gaining increasing attention in various domains.Many systems have been proposed and achieved significantperformance improvement over large-scale graph process-ing problems, including Tesseract [1], Graphicionado [2],GraphPIM [3], Gemini [4], GraphLab [5], etc. [6]–[9].

The essential way to improve the performance of large-scale graph processing is to provide a higher bandwidth ofdata access. However, achieving high bandwidth of graphprocessing on conventional architectures suffers from fourchallenges.

1) Random Access: The data access pattern in major graphalgorithm is highly irregular, which leads to consider-able local bandwidth degradation (e.g., >90% band-width degradation on CPU-DRAM hierarchy [10]) onconventional computation architecture.

2) Poor Locality: A simple operation on one vertexmay require access to all its neighbor vertices. Suchpoor locality leads to irregular global data access andinefficient bandwidth utilization (e.g., 57% bandwidthdegradation on two-hop [11]) on multiprocessing unitsarchitecture.

3) Unbalanced Workloads: The graph is highly unstruc-tured, thus the computation workloads of various pro-cessing units can be quite unbalanced.

4) Heavy Conflicts: Simultaneous updating to the same ver-tex by different processing units causes heavy conflicts.

All these four challenges need to be addressed from botharchitecture and algorithm perspectives.

Processing-in-memory (PIM) has been put forward as apromising solution to providing high bandwidth for big data

0278-0070 c© 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: GraphH: A Processing-in-Memory Architecture for Large

DAI et al.: GraphH: PIM ARCHITECTURE FOR LARGE-SCALE GRAPH PROCESSING 641

Fig. 1. Challenges in large-scale graph processing and solutions in GraphH.Our system makes use of advantages of PIM and overcomes disadvantagesof PIM, providing solutions to each challenge in graph processing.

problems. PIM achieves speedup at several orders of magni-tude over many applications [1], [3], [6], [12], [13]. Variousprocessing units can be put into the memory and attached to apart of the memory in PIM. The total theoretical bandwidth ofall units under the PIM architecture can be 10×–100× largerthan that under conventional computing architectures [1].Moreover, PIM scales well to large-scale problems because itcan provide proportional bandwidth to the memory capacity.

Although PIM can provide advantages such as high band-width and massive processing units (Fig. 1), it still suffersfrom inherent disadvantages in graph processing. For exam-ple, PIM is sensitive to unbalanced workloads and conflictsas well as other multicore architectures. Moreover, achiev-ing high bandwidth in large-scale graph processing suffersfrom both random access pattern and poor locality, these prob-lems remain in PIM. To tackle all these problems, previousworks have proposed several solutions. Tesseract [1] intro-duced the prefetcher to exploit locality, but such prefetchingstrategy cannot avoid global random access to graph data.GraphPIM [3] proposed the instruction off-loading scheme forhybrid memory cube (HMC)-based graph processing, whilehow to make use of multiple HMCs is not presented.

Therefore, using multiple PIM devices (e.g., HMC array)can be a scalable solution for large-scale graph processing.Although Tesseract [1] has adopted the HMC array struc-ture for graph processing, open questions remain in how tofully utilize the high bandwidth provided by PIM devices.In this paper, we propose GraphH, an HMC array archi-tecture for large-scale graph processing problems. Comparedwith Tesseract [1], both hardware architecture and partition-ing/scheduling algorithms are elaborately designed in GraphH.From the architecture perspective, we design specialized hard-ware units in the logic layer of HMC, including SRAM-basedon-chip vertex buffers (OVBs), to fully exploits the localbandwidth of a vault. We also propose the reconfigurable

double-mesh connection (RDMC) scheme to ensure localityand provide high global access bandwidth in graph processing.From the algorithm perspective, we propose the index map-ping interval-block (IMIB) partitioning method to balance theworkloads of different processing units. Scheduling methodlike round interval pair (RIP) is also introduced to avoid writ-ing conflicts. All these four designs, which are not introducedin Tesseract [1], lead to performance improvements (detailedin Section VI-C, 4.58× using OVB, 1.29× using RDMC+RIP,and 3.05× using IMIB, averagely). Contributions of this paperare concluded as follows.

1) We integrate the OVB in GraphH. Processing units candirectly access OVB and do not suffer from randomaccess pattern.

2) We connect cubes in GraphH using the RDMC schemeto provide high global bandwidth and ensure locality.

3) We partition graphs using IMIB method to balanceworkloads of cubes.

4) We schedule the data transferring among cubes usingRIP scheme. Communication among cubes are orga-nized in pairs and conflicts are avoided.

5) We propose two optimization methods to reduce syn-chronization overhead and reuse on-chip data. Thus, wefurther improve the performance of GraphH.

We have also conducted extensive experiments to evalu-ate the performance of the GraphH system. We choose bothreal-world and synthetic large-scale graphs as our bench-marks and test three graph algorithms over these graphs. Ourevaluation results show that GraphH remarkably outperformsDDR-based graph processing systems by up to two orders ofmagnitude and achieves up to 5.12× speedup compared withTesseract [1].

The rest of this paper is organized as follows. Section IIintroduces the background information of graph processingmodels and HMCs. Section III proposes the architecture ofGraphH. Then, the processing flow in GraphH is detailed inSection IV. We further propose two optimization methods toimprove the performance of GraphH in Section V. Results ofcomprehensive experiments are shown in Section VI. Relatedworks are introduced in Section VII and we conclude thispaper in Section VIII.

II. PRELIMINARY AND BACKGROUND

In this section, we introduce the background informationof both large-scale graph processing and PIM technology.The notations of our graph abstraction used in this and thefollowing sections are shown in Table I.

A. Graph Abstraction

Given a graph G = (V, E), where V and E denote vertexand edge set of G, the computation task over G is to update thevalue of V . We assume that the graph is directed so that eachedge ei.j in G is associated with a source vertex vi and a des-tination vertex vj. The undirected graph can be implementedby adding an opposite edge ej.i to ei,j in G.

To build a general graph processing framework, most large-scale graph processing systems, including GraphLab [5] and

Page 3: GraphH: A Processing-in-Memory Architecture for Large

642 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 38, NO. 4, APRIL 2019

TABLE INOTATIONS OF A GRAPH

Algorithm 1 Pseudocode of Edge-Centric Model [14]Input: G = (V, E), initialization conditionOutput: Updated V

1: for each v ∈ V do2: Initialize(v, initialization condition)3: end for4: while (not finished) do5: for each e ∈ E do6: value(edst) =Update(esrc, edst)7: end for8: end while9: return V

Pregel [7], adopt a high-level graph algorithm abstractionmodel called vertex-centric model (VCM). VCM is a com-monly used model applied to different graph algorithms. InVCM, graph processing is divided into iterations. Each vertexneeds to communicate with all its neighbor vertices to per-form the updating operation. Edge-centric model (ECM) [14]describes VCM from another perspective.

Algorithm 1 shows a detailed example of graph process-ing under the ECM. In ECM, value of different verticesis initialized at the beginning of the program (line 2 inAlgorithm 1). Then, the algorithm is executed in the step ofiterations, and each edge is traversed in an iteration (lines 4–8in Algorithm 1). When an edge is accessed, the destinationvertex is updated using the value of source vertex (line 6 inAlgorithm 1). ECM is a high-level abstraction model for graphprocessing, different algorithms only differ in the Initialize()and Update() function. For example, in breadth-first search(BFS), all vertices value is set to infinity while the value ofroot vertex is set to zero using Initialize(). In the Update(),a destination vertex will be updated if the depth of a sourcevertex is smaller than its depth.

B. Graph Partitioning

Graph partitioning is commonly used to ensure the local-ity of graph data access. Many previous works proposedgraph partitioning strategies. Distributed/multicore systemslike Gemini [4] and Polymer [11] mainly focused on issueslike locality, cache coherence, low-overhead scaling outdesigns, etc. NXgraph [8] proposed an interval-block-basedgraph partitioning method. After preprocessing, all verticesin the graph are divided into P disjointed intervals. Edges

(a) (b)

Fig. 2. Intervals and blocks (vertices and edges are divided into three intervalsand 3 × 3 = 9 blocks). (a) Example. (b) Graph partitioning.

are divided into P shards according to their destination ver-tices. Furthermore, edges in a shard are divided into P blocksaccording to their source vertices.

Fig. 2 shows an example of this interval-block graph par-titioning method. Vertices in the graph are divided into threeintervals, three shards, and 3 × 3 = 9 blocks. Each edge isassigned to a block according to it source and destination ver-tices. For example, interval I2 contains vertices v3, v4, and v5,interval I3 contains vertices v6, v7, and v8, thus block B2.3contains edges e3.8, e4.8, and e5.8. In this graph partitioningmethod, when one interval Ix updates another interval Iy usingcorresponding block Bx.y, other intervals and blocks will notbe accessed.

C. Hybrid Memory Cube

The architecture of PIM has been proposed to providea higher bandwidth from the perspective of hardware. Byallocating processing units inside memory, PIM achievesmemory-capacity-proportional bandwidth so that can scale tolarge-scale problems.

3-D die-stacking memory devices, such as HMC [15], allowus to implement PIM with high bandwidth memory. For exam-ple, an HMC device in a single package contains multiplememory layers and one logic layer. These layers are stackedtogether and use through-silicon via (TSV) technology as con-nections to achieve a high bandwidth. In an HMC, memoryand logic layers are organized into vertical vaults, which canperform computation independently. The latest HMC devicescan provide at most 8-GB memory space, and up to 480-GB/sexternal bandwidth (four multiple serial links, each with adefault of 16 input lanes and 16 output lanes for full duplexoperation) referred to the Micron HMC 2.1 specification [16].Inside an HMC, there are 32 vaults. Each vault consists of alogic layer and several memory layers which can provide up to256 MB of memory space and 10 GB/s of bandwidth (16 GB/sin Tesseract [1]). Thus, the maximum aggregate internal band-width of an HMC device can be up to 320 GB/s according tothe HMC 2.1 specification (512 GB/s in Tesseract). Becauseof advantages like high internal bandwidth and low cost ofmoving data, the HMC has been taken into account as an effec-tive solution for some data-intensive applications [1], [3], [12].

Page 4: GraphH: A Processing-in-Memory Architecture for Large

DAI et al.: GraphH: PIM ARCHITECTURE FOR LARGE-SCALE GRAPH PROCESSING 643

Fig. 3. Detailed architecture of GraphH (from left to right: the cube array and the reconfigurable double-mesh connection controlled by the host processorthrough the crossbar switch, the HMC architecture, the 3-D-stacking architecture in a vault, and the detailed implementation of the logic layer).

For example, Gao et al. [12] used HMCs for neural computingand achieved 4.1× improvement compared with conventionalarchitectures.

Although Ahn et al. [1] proposed Tesseract to implementgraph processing using PIM device, they did not fully utilizethe high bandwidth of PIM. In Tesseract, only one prefetcheris integrated as a queue for messages without giving muchthought to graph data access patterns. As result, Tesseractinvolves unpredictable global access, leading to low band-width utilization (e.g., <3000-GB/s bandwidth usage while16 HMCs can provide up to 16 GB/s×32 × 16 = 8192 GB/s-bandwidth, <37% bandwidth utilization).

III. GRAPHH ARCHITECTURE

Based on background information in Section II, we designthe GraphH. We use the HMC as an example implementationof GraphH.

A. Overall Architecture

Our GraphH implementation consists of a 16-cube array(same as Tesseract [1]), shown in Fig. 3. Inside each HMC,there are 32 vaults and a crossbar switch to connect themto external links. A vault consists of both logic layer andmemory layers to execute graph processing operation. A hostprocessor is responsible for data allocation and interconnectionconfiguration.

1) HMC and Double-Mesh: GraphH physically arranges 16cubes in a 2-D array structure. To connect 16 cubes, GraphHuses a double-mesh structure as interconnections (representedby the solid line and the dotted line in Fig. 3). The connec-tivity of this double-mesh can be dynamically reconfigured atruntime, which is detailed in Section III-C.

2) Vault and Crossbar: The vault is the basic unit of HMC.An HMC is composed of 32 vaults and provides high band-width to each vault. According to the HMC 2.1 specification,such a collective internally available bandwidth from all these32 vaults is made accessible to the I/O links using a crossbarswitch.

3) Layer and TSV: Inside each vault, there are severalmemory layers and a logic layer. These layers are stacked

together using TSV [17]. According to the HMC 2.1 specifi-cation, the bandwidth between memory layers and the logiclayer of each vault can be up to 10 GB/s. We implement asimple in-order core, two specific OVBs, the data controller,and the network logic in the logic layer.

B. On-Chip Vertex Buffer

Data access pattern in graph processing is highly random-ized, leading to unpredictable latency (e.g., 2×–3× latencydifference between global and local access [11]) of access-ing different vertices/edges and efficient bandwidth degrada-tion (e.g., 57% bandwidth degradation on two-hop [11]). Toovercome the unpredictable latency and bandwidth loss, weintroduce OVBs as the bridge between the in-order core andthe memory.

In the ECM, updating is propagated from the source vertexto the destination vertex. Both source vertices and destina-tion vertices are randomly accessed. Two types of OVB, thesource vertex buffer and the destination vertex buffer, are inte-grated into the logic layer of a vault using SRAM. Instead ofdirectly access vertices stored in DRAM layers when process-ing, GraphH first loads vertices data into OVB. The in-ordercore can directly access vertices value stored in OVB in therandom pattern without bandwidth degradation. After all ver-tices in the destination vertex buffer have been updated (allcorresponding in-edges have been accessed by the in-ordercore), GraphH flushes data in the destination vertex bufferback to DRAM layers and starts to process other vertices inthe same way.

By adopting OVB in the logic layer, the random access pat-tern of DRAM layers is avoided. Data are sequentially loadedfrom and written back to the stacked memory. We assume thatthe logic layer has the same size as dram layer. Accordingto Micron HMC 2.1 specification [16], each DRAM layer is8 GB. We use Cacti 6.5 [18] to evaluate the area properties ofboth SRAMs and DRAMs for the sake of consistency. The areaof an 8-GB DRAM with 64-bit input/output bus (data+address)under 32-nm process is 257.02 mm2 [1], the area consumptionis 1.65 mm2 under 32-nm process. By allocating two OVBs(source vertex buffer and destination vertex buffer) to each

Page 5: GraphH: A Processing-in-Memory Architecture for Large

644 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 38, NO. 4, APRIL 2019

Fig. 4. Examples of reconfiguring double mesh connection using the crossbarswitch. Lines in orange, yellow, and green represent the way of configuring thecorresponding cube into a source, destination, and router cube, respectively.

buffer, the total area consumption of OVBs is 105.6 mm2,which is less than half of the area of the logic layer.

C. Reconfigurable Double-Mesh Connection

According to Micron HMC 2.1 specification [16], each cubehas 8 high-speed serial links providing up to 480-GB/s aggre-gate link bandwidth. Many previous researchers have studiedthe intercube connection scheme, and nearly most of themare static based on routers, which means all connections arepre-established. Kim et al. [19] proposed and analyzed theinterconnection, the topologies can be mesh, flattened butter-fly, dragonfly, etc. Under the static connection scheme, a cubeis often connected to 3–4 adjacent cubes, and such 480-GB/sexternal link will be partitioned by these cubes with the band-width around 120–160 GB/s. However, the aggregate internalbandwidth of a cube is 320 GB/s (10 GB/s per vault), thus suchstatic interconnection scheme cannot fully utilize the internalbandwidth of HMC when transferring data among cubes.

To avoid disadvantages of using static connections basedon routers, GraphH adopts an RDMC scheme shown inFig. 3. A mesh consists of 24 meta-connections (connec-tion between two adjacent cubes) and 48 joints. Each meshhas individual data path and owns exclusive bandwidth. Eachmeta-connection can provide up to 480-GB/s physical band-width with the full duplex operation. The joint on each side ofa meta-connection can be configured to connect external linksof a cube or another meta-connection. In this way, two cubescan be connected using an exclusive link with up to 480-GB/sbandwidth. GraphH refers to a look-up table (LUT) in thehost processor to store all configurations required by process-ing scheme (detailed in Section IV). In this way, GraphH canbuild up an exclusive data path for any two cubes in the array.To implement the reconfigurable connection scheme accord-ing to the prestored LUT, GraphH uses a crossbar switch [20]connected to four links and four meta-connections (each meta-connection has four individual links for four serial links of thecube, 4 × 4 = 16 in total). Fig. 4 shows an example of howthe crossbar switch works. When a cube in a mesh network isthe source/destination of data transferring, the crossbar switch

Fig. 5. Size of blocks: dividing consecutive vertices into an interval (left,unbalanced) and dividing vertices using IMIB (right, balanced). Graphs arefrom Table III.

connects four output/input serial links of the cube to the corre-sponding meta-connection (orange and yellow lines in Fig. 4).When a cube is a routing node, the crossbar switch connectsmeta-connections without transferring data to the cube (greenlines in Fig. 4). The status of the crossbar is controlled bythe host processor according to a prestored LUT. Moreover,the interconnection reconfiguration can be executed simulta-neously during processing graphs in the cube (no data aretransferred among cubes when processing graphs in a cube).Thus, such a reconfigurable connection scheme will not affectthe whole GraphH performance. Compared with the staticconnection scheme, such reconfigurable connection schememitigates the cost of using routers and maximizes the linkbandwidth of two connected cubes.

IV. GRAPHH PROCESSING

Based on the proposed GraphH architecture introduced inSection III, we introduce the graph processing flow of GraphHin this section, high level programming interface of GraphHis also introduced.

A. Overall Processing Flow

GraphH adopts ECM and interval-block partitioning methodfor graph processing tasks. In the preprocessing and dataloading step, the vertex set V is divided into 16 equal-sizedintervals and then assigned to each HMC cube; for eachinterval, the 16 corresponding in-edge blocks are also loaded

Page 6: GraphH: A Processing-in-Memory Architecture for Large

DAI et al.: GraphH: PIM ARCHITECTURE FOR LARGE-SCALE GRAPH PROCESSING 645

Algorithm 2 Pseudocode of Updating Ix in Cubex

Input: G = (V, E), initialization conditionOutput: Updated V

1: for each iteration do2: for each Cubey do3: receive Iy from Cubey //Transferring Phase4: update Ix using Iy and By.x in parallel //Updating

Phase5: end for6: end for

into the same cube. In the next processing step, program exe-cution is divided into iterations, as in Algorithm 2. SinceECM iterates over edges and requires both source and des-tination vertex data of the edge to be available, each cube(Cubex) would first receive a source interval (Iy) from anothercube (Cubey) in Transferring Phase. Then Cubex performsUpdating Phase to update Ix in parallel with the data storedin Iy and By,x.

To exploit vault-level parallelism, vertices in an intervalare further divided into 32 disjoint small sets. Each vault isresponsible for updating one small set with the correspond-ing in-edges. Thus, a block is also divided into 32 small setsand stored in each vault. The source interval is duplicated andstored in each vault.

B. Index Mapping Interval-Block Partition

The execution time of each cube is in positive correlationwith sizes of both its vertices and edges [4]. All cubes needto be synchronized after processing. Thus, it is important toadopt a balanced partitioning method. Vertices are divided intointervals and assigned to cubes, as well as corresponding edgesin blocks.

One naive way is to averagely divide consecutive ver-tices into an interval. For example, in a graph with ninevertices, v1 ∼ v7, v9, and v11.1 We divide all vertices intothree intervals, the separation results are I1 = {v1 ∼ v3},I2 = {v4 ∼ v6}, and I3 = {v7, v9, v11}. However, when weimplement this method on natural graphs [Fig. 5(left), graphsare from Table III], sizes of different blocks are quite unbal-anced, which leads to unbalanced workloads. The reason ofunbalanced sizes is from the power-law [21] of natural graphs.A small fraction of vertices often process most of the edgesin a graph. Moreover, these vertices often have consecutiveindexes and are divided into one interval [e.g., B2.1 in AS,B1.1 in LJ, B5.5 in TW, and B1.1 in YH, account for 10.86%,10.73%, 3.71%, and 18.19% of total edges, respectively, inFig. 5(left)].

To avoid the unbalanced workloads among cubes, weadopt the IMIB partitioning method. IMIB consists oftwo steps.

1) Compression: We compress vertex indexes by removingblank vertices. For example, v1 ∼ v7, v9, and v11 are

1Some vertex indexes do not appear in the original raw edge list of a graphlike v8 in this example.

mapped to v1 ∼ v9 with v1 ∼ v7 → v1 ∼ v7, v9 →v8, v11 → v9.

2) Hashing: After mapping vertices to compressed indexes,we divide them into different interval using modulofunction. For example, v1 ∼ v9 (after compression)are divided into three intervals with I1 = {v1, v4, v7},I2 = {v2, v5, v8}, and I3 = {v3, v6, v9}.

With IMIB, sizes of both intervals and blocks can bebalanced [Fig. 5(right), the ratios of largest and smallestblocks are 1.32×, 1.11×, 1.22×, and 1.18×, respectively, inthree graphs, much smaller than the infinity (some blocksare empty), 6175×, 996×, and 33 334× which uses naivepartitioning method. In this way, workloads are balanced inGraphH. Although there are also many other partitioning meth-ods, like METIS [22], we use IMIB to minimize preprocessingoverhead. The time complexity of IMIB (as well as the divid-ing consecutive vertices into an interval) is O(m) because weonly need to scan all edges without extra calculations (e.g.,we do not need to get the degree of each edge to perform par-titioning scheme based on the number of edges in a partitionlike Gemini [4]).

C. Round Interval Pair Scheduling

From the inner for loop in Algorithm 2 in Section IV-A wecan see that interval data update in each cube can be dividedinto transferring phase and updating phase. During the trans-ferring phase, a cube receives an interval from another cube.Since intervals of different cubes are disjointed, the procedurecan be parallelized as follows. The 16 intervals in our GraphHimplementation are organized as eight disjoint interval pairs.Two intervals in a pair send local interval data to another cube.Thus, updating of eight interval pairs can be executed in par-allel. The operation is denoted as one round. A complete outerfor loop iteration in Algorithm 2 needs 16 such rounds. Pairconfigurations are updated in each round so that any cubehas been paired with all other cubes once when one iterationfinishes.

Fig. 6(a) shows an example solution of interval pairs in16 rounds. Round 0 refers to the round that each intervalupdates its value. Note that this is not the only pair config-uration. Based on this scheduling scheme, an interconnectionreconfiguration scheme is a prerequisite to implementing suchdesign. We adopt the dynamic interconnection scheme insteadof static ones. The host processor controls switch statuses ofall meta-connections (Fig. 3). GraphH can easily configurethe interconnections in all 16 rounds according to a prestoredLUT. The targets of our interconnection scheme are: 1) imple-menting the RIP scheduling scheme and 2) avoiding two ormore cube pairs using the same meta-connection to trans-fer data. Under the double-mesh structure, Fig. 6(b) showsthe interconnection implementation of the example solutionin Fig. 6(a). Eight individual connections are configured tobuild up direct data paths for interval pairs (labeled in eightdifferent colors) except round 0. In this way, conflicts amongcubes are eliminated because a cube is updated by only onecube at one time, using an exclusive data path with 480-GB/sbandwidth.

Page 7: GraphH: A Processing-in-Memory Architecture for Large

646 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 38, NO. 4, APRIL 2019

(a)

(b)

Fig. 6. Scheduling and connection of 16 cubes in different rounds of an updating iteration. (a) Interval pairs in 16 rounds. (b) Reconfigurable connectionbased on the double-mesh of interval pairs in 16 rounds.

Algorithm 3 Pseudocode of Processing in GraphHInput: G = (V, E), initialization conditionOutput: Updated V

1: HMC_Allocate(V) & HMC_Initialize(V)2: while (finished = false) do3: for each round do4: for each interval pair < Ix, Iy > (all interval pairs in

parallel) do5: transfer Iy to Cubex (update Iy in parallel)6: for each edge e ∈ a vault in Cubex (all vaults in

parallel) do7: value(edst) =HMC_Update(edst, value(esrc))8: end for9: HMC_Intracube_Barrier(Ix)

10: end for11: HMC_Intercube_Barrier()12: end for13: finished = HMC_Check(V)14: end while

D. Programming Interface

Algorithm 3 shows the execution flow of GraphH. OnlyHMC_Initialize() and HMC_Update() need to be defined byusers using high-level languages.

1) HMC_Allocate(): The host processor allocates graphdata to vaults and cubes according to IMIB.

2) HMC_Initialize(): This function initializes the value ofgraph data.

3) HMC_Update(): This function defines how a source ver-tex updates a destination vertex using an edge. We can

TABLE IIREDUCE SYNCHRONIZATION OVERHEAD

consider HMC_Update() function as a particular case ofthe Update() function in Algorithm 1.

4) HMC_Intracube_Barrier(): This is the synchronizationfunction in a cube. GraphH synchronizes 32 vaults toensure all vaults have finished updating.

5) HMC_Intercube_Barrier(): This is the synchronizationfunction for all interval pairs.

6) HMC_Check(): This function checks if the terminalcondition is satisfied.

V. GRAPHH OPTIMIZATION

A. Reduce Synchronization Overhead

As mentioned in Algorithm 3, all cubes need to be syn-chronized 16 times in an iteration (per Updating Phase andTransferring Phase). Such synchronization operations sufferfrom unbalanced workloads of different cubes because theexecution time of processing a block in a cube is relativeto the graph dataset. However, the time of transferring dataamong cubes is easy to balance because we set each intervalto the same size in GraphH. Thus, instead of receiving intervaldata from other cubes in each round, a vault can store thevalue of all vertices thus all cubes can perform the updatingsimultaneously without transferring data among cubes.

Fig. 7 shows the difference between original schedul-ing scheme and scheduling after reducing synchronization

Page 8: GraphH: A Processing-in-Memory Architecture for Large

DAI et al.: GraphH: PIM ARCHITECTURE FOR LARGE-SCALE GRAPH PROCESSING 647

(a)

(b)

Fig. 7. Difference in optimized GraphH. Data flow of (a) GraphH and (b)optimized GraphH.

overhead. In Updating Phase, each cube does not need tocommunicate with others before finishing updating. After fin-ishing processing using all vertices value, all vaults receive theupdated value of all vertices from other vaults in TransferringPhase. In this phase, we can also use the RIP schedulingscheme shown in Fig. 6, each cube only receives updatedvertices value without updating. Table II compares synchro-nization times and space requirement in a vault of twoscheduling methods. Because all vertices need to be storedin a vault, some large graph may not adopt this method due tomemory space limitation in a vault. We will solve this issuein Section V-B.

B. Reuse Data in the Logic Layer

In order to update different destination vertices in a cube inparallel, all vaults need to store a complete copy of the sourceinterval. Such implementation has two drawbacks.

1) The memory space of a vault is limited (8 GB÷32 =256 MB [16]), which may be not sufficient to store allintervals in our first optimization method (Section V-A).

2) All source vertices need to be loaded from memorylayers to the logic layer 32 times (one time by eachvault).

In order to overcome two disadvantages mentionedabove, we adopt four 8×8 crossbar switches followed byeight 4×4 crossbar switches running at 1 GHz (same asGraphicionado [2]) with standard virtual output queues [23] toshare source vertices value of 32 vaults in GraphH. Source ver-tices from any vault can be sent to destination interval buffersin any vault in this way. Assuming that we use 8 bytes tostore an edge, the maximum throughput of DRAM in a vaultis 10 GB/s÷8 bytes = 1.25 edges/ns. Thus, the throughput ofDRAMs matches that of crossbar switches. Moreover, becausewe do not need to modify the value of source vertices underour processing model, all crossbar switches can be pipelinedwithout suffering from hazards (no forwarding units need tobe adopted in the design). In this way, instead of duplicatingintervals among vaults in a cube, GraphH can share sourcevertices among vaults. Such data reuse architecture is shownin Fig. 8.

Fig. 8. Crossbar switches to reuse vertices in the logic layer.

VI. EXPERIMENTAL EVALUATION

In this section, we first introduce the simulation setup ofour GraphH design, followed by the workloads of experimentsused in this section, including the graphs and algorithms.

A. Simulation Setup

All experiments are based on our in-house simulator. Tracefiles of graph data access patterns are first generated based onDynamoRIO [24]. Then, we apply these traces to timing modelgenerated by Cacti 6.5 [18] and DRAMsim2 [25] to get theexecution time. The overhead of reconfiguring the double meshconnection by the host has been taken into consideration in thesimulator, detailed in Section III-C. We run all our experimentson a personal computer equipped with a hexa-core Intel i7CPU running at 3.3 GHz. The bandwidth of each vault is setto 10 GB/s according to HMC 2.1 specification (16 GB/s inTesseract). On the logic layer in a cube, We implement eight4-MB shared source vertex buffer running at 4 GHz and 321-MB individual destination vertex buffer running at 2 GHz toperform the write-after-read operation. We use ARM Cortex-A5 with FPU (without cache) running at 1 GHz as a demo ofthe in-order core.

B. Workloads

1) Algorithms: We implement three different graph algo-rithms. BFS calculates the shortest path from a given rootvertex to all other vertices in the graph. PageRank (PR) eval-uates the importance of all websites in a network accordingto the importance of their neighbor websites. Connected com-ponents (CC) detects all subgraphs in an arbitrary graph. Thenumber of iterations for PR is set to 10 in our simulation, whilefor other two algorithms the number of iterations depends onthe graph data thus we simulate them to convergence.

2) Graphs: Both natural graphs and synthetic graphs areused in our experiments. We conduct four natural graphsincluding an Internet topology graph as-skitter (AS) from traceroutes run daily in 2005, live-journal (LJ) from the LiveJournalnetwork, twitter-2010 (TW) from the Twitter social network,and yahoo-web (YH) from the Yahoo network which con-sists of billions of vertices and edges. We also conduct a set

Page 9: GraphH: A Processing-in-Memory Architecture for Large

648 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 38, NO. 4, APRIL 2019

TABLE IIIPROPERTIES OF BENCHMARKS

Fig. 9. Speedup using OVB.

of synthetic graphs, delaunay_n20 to delaunay_n24, to eval-uate the scalability of GraphH. The properties of these graphbenchmarks mentioned above are shown in Table III.2

C. Benefits of GraphH Designs

Compared with Tesseract [1], four techniques architecture(OVB and RDMC) and algorithm (IMIB and RIP) perspectivesare introduced in GraphH. To show the performance improve-ment by adopting these designs, we simulate the performancedifference with/without these techniques in this section. Tocontrol the single variable, we adopt all other techniques whensimulating the influence of one technique in this section. Twooptimization methods in Section V are also adopted in thissection.

1) Benefits of OVB: GraphH adopts on-chip vertices bufferto avoid random access pattern to DRAM layers. We comparethe performance of implementing OVB in the logic layer withthe performance of directly accessing DRAM layers. The sizeof source/destination interval buffers is set to 1MB per vault.Techniques like crossbar switches and shared source intervalbuffers are adopted.

As we can see in Fig. 9. By implementing OVB in the logiclayer, GraphH achieves 4.58× speedup compared with directlyaccessing DRAM layers on average.

2) Benefits of RDMC and RIP: As we can see in Fig. 10,using GraphH adopts RDMC to maximize the interconnectionbandwidth between two cubes. Moreover, to avoid conflicts

2Indexes of vertices have been compressed. Thus, the number of verticesmay not equal to the largest vertex index or number of vertices appeared inother papers.

Fig. 10. Speedup using RDMC/RIP.

Fig. 11. Speedup using IMIB partitioning method.

among cubes, the RIP scheduling scheme is implemented onRDMC. RDMC and RIP work together in GraphH. We com-pare the performance using RDMC+RIP with a static singlemesh connection network under RIP-like routing scheme. Thephysical bandwidth of each connection in such static meshnetwork is also set to 480 GB/s, but each cube can only share aquarter of that bandwidth because only one out of four externallinks is connected to one meta-connection.

Using RDMC+RIP-based connection and schedulingscheme, GraphH achieves 1.29× speedup compared with thestatic connection method on average.

3) Benefits of IMIB: Workloads of different cubes in around can be balanced using our IMIB partitioning methods.We compare the performance of using IMIB with chunk-basedmethod [4] (dividing vertices with consecutive indexes into apartition). These two partitioning methods introduce least pre-processing overhead (scanning all edges without other linearalgebra operations).

As we can see from Fig. 11, using IMIB achieves 3.05×speedup compared with chunk-based partitioning on average.Such conclusion is in contrast to the conclusion in Gemini [4],which concludes that hash-based partitioning leads to morenetwork traffic and worse performance. The reason is that theinterconnection bandwidth in GraphH provided by HMC istwo orders of magnitude as that in Gemini, thus balancing

Page 10: GraphH: A Processing-in-Memory Architecture for Large

DAI et al.: GraphH: PIM ARCHITECTURE FOR LARGE-SCALE GRAPH PROCESSING 649

Fig. 12. Performance comparison among DDR-based system, GraphH under different configurations, and Tesseract (normalized to DDR-based system, thusbars of DDR are omitted).

workloads is more important in this situation. We will dis-cuss the influence of network bandwidth on the whole systemperformance in Section VI-E3.

D. Performance

Based on experimental results in Section VI-C, both archi-tecture (OVB and RDMC) and algorithm (IMIB and RIP)techniques in GraphH lead to performance profits. We imple-ment these techniques and evaluate the performance of systemsunder different configurations.

1) DDR-Based System: We run software graph processingcode on a physical machine. The CPU is an i7-5820Kcore and the bandwidth of DDR memory is 12.8 GB/s.

2) Original-GraphH System: This is the GraphH systemwithout using optimization methods.

3) Opt1-GraphH System: We reduce synchronization over-heads using optimization method in Section V-A.

4) Opt2-GraphH System: We reuse vertex data in the logiclayer using optimization method in Section V-B.

5) Optimized-GraphH System: Based on two optimizationmethods in Section V, this is the combination ofOpt1/Opt2-GraphH system.

6) Tesseract System: We do some modifications in GraphHto simulate Tesseract [1], including: a) the size of theprefetcher in Tesseract is set to 16 KB; b) we assumeno global access latency but the external bandwidth isset to 160 GB/s (each cube is linked to three cubes inTesseract); and c) workloads are balanced in Tesseract.In this way, we can simulate the performance upperbound of Tesseract.

1) Performance Comparison: We compare the performanceof GraphH under different configurations with both DDR-based and Tesseract system. The comparison result is shownin Fig. 12. As we can see, both Tesseract and GraphHoutperforms the DDR-based system by one to two ordersof magnitude. GraphH achieves 1.16×–5.12× (2.85× onaverage) speedup compared with Tesseract because GraphHintroduces both architecture and algorithm designs to solvechallenges in graph processing.

Reducing cube synchronization overhead has limited con-tribution to GraphH performance, because workloads of dif-ferent cubes have been balanced using IMIB. Moreover,such implementation needs to store all vertices value in avault. Thus, it may not apply to larger graphs (e.g., PR/CCon YH, there is no bar for original and opt1). Reusing

Fig. 13. Execution time breakdown when running PR (processing: processingedges in a cube; loading vertices: transferring vertices between OVBs andDRAM layers; and transferring: transferring data among cubes).

on-chip vertices data (opt2/optimized) leads to 2.28× averageperformance improvement compared with original/opt1 con-figuration, because such implementation leads to fewer datatransferring between OVBs and DRAM layers (detailed inSection VI-D2).

2) Execution Time Breakdown: Fig. 13 shows the executiontime breakdown in GraphH. Note that we assume the memoryspace of one vault is enough to store required data, thus forlarger graphs like YH, we can get results of original/opt1configurations.

As we can see, loading vertices (including writing backupdated vertices data) accounts for 73.99% of total executiontime under original/opt1 configurations. By adopting sharedsource interval memories, GraphH significantly reduces thetime of transferring vertices between OVBs and DRAM lay-ers to 50.63% for AS, LJ, and TW. For larger graphs likeYH, transferring vertices between OVBs and DRAM layersstill account for over 86.21% of total execution time. Largeron-chip SRAM can relieve such bottleneck to some extent,but GraphH can only provide 1 MB source vertices buffer pervault due to the area limitation.

3) Scalability: PIM can achieve memory-capacity-proportional bandwidth, so it scales well when processinglarge-scale problems. We compare the scalability of GraphHwith Tesseract. We run PR algorithm on five synthetic graphs,delaunay_n20 to delaunay_n24 [26]. We use the milliontraversed edges per second as the metric to evaluate thescalability of system performance.

Page 11: GraphH: A Processing-in-Memory Architecture for Large

650 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 38, NO. 4, APRIL 2019

Fig. 14. Scalability running PR on synthetic graphs.

Fig. 15. Power density and its distribution in GraphH when running PR onreal-world graphs.

Fig. 14 shows the scalability of both GraphH and Tesseract.As we can see, both GraphH and Tesseract scales well to largergraphs. GraphH provides 2.31× throughput than Tesseract onaverage.

4) Power Density: We analyze the power density ofGraphH in Fig. 15 using Cacti 6.5 [18] and previous HMCmodel [15]. Energy consumption of hardware support likeOVBs is included in this figure. Logic layer accounts for58.86% power in GraphH. By introducing extra hardwaresupport for graph processing on HMCs, GraphH consumesmore power than Tesseract (94 mW/mm2), but the power den-sity is still under the thermal constraint (133 mW/mm2) [27]mentioned in Tesseract [1].

E. Design Space Exploration

1) Different OVB Sizes: We choose 1 MB for both sourceand destination interval buffers per vault in our GraphH imple-mentation. We compare the performance comparison usingdifferent OVB sizes in Fig. 16. We run the PR algorithm onfour graphs. Note that the size is the source interval buffer sizeper vault, because the shared source interval buffer is adopted.Considering one goal of HMC is to have a small footprint, wealso add the performance of adopting no OVB in Fig. 16. Theexecution time is normalized to Tesseract.

Fig. 16. Speedup using different OVB sizes (normalized to Tesseract).

Fig. 17. Performance on different array sizes.

As we can see in Fig. 16, for small graphs like AS andLJ, SRAMs of 0.25 MB per vault are enough to store all ver-tices on the logic layer. However, for larger graphs, smallerOVB leads to significant performance degradation becausemore data are transferred between OVBs and DRAM layers. Insome situations (e.g., YH), small OVBs cannot bring benefitscompared with the no OVB configuration and Tesseract. Aswe can see, GraphH can still achieve 1.40× speedup againstTesseract even without OVBs.

2) Different Cube Array Sizes: We study the influence ofthe different number of cubes on the performance. We run PRperformance of different array size: a 2×2 array (four cubes), a2×4 array (eight cubes), a 4×8 (32 cubes), and the current 16cubes’ design. The scheduling schemes are akin [e.g., we usecubes 1–8 to perform rounds 0–7 in Fig. 6(b) as the schedulingscheme of eight cubes]. Thirty-two cubes need a triple-meshstructure to avoid conflicts among cubes.

Fig. 17 shows the performance comparison of different arraysizes. We normalize the speedup to the performance of the 2×2array. The axis on the left (light green bar) shows that usingmore cubes leads to better absolute performance. However,when we normalize the speedup to the number of cubes (right

Page 12: GraphH: A Processing-in-Memory Architecture for Large

DAI et al.: GraphH: PIM ARCHITECTURE FOR LARGE-SCALE GRAPH PROCESSING 651

Fig. 18. Proportion of transferring data among cubes in the total execution,when the total external bandwidth of a cube varies.

axis, dark green line), we find that speedup per cube is notmonotonically increasing. For small graphs like AS and LJ,the OVB of 0.25 MB per vault is enough to store all ver-tices and using more cubes may lead to unbalanced issues. Forlarger graphs like YH, using more cubes provides larger OVB,thus leads to better performance (normalized to the number ofcubes). Simply adding cubes to GraphH can lead to enhance-ment of absolute performance, but it can also cause inefficientutilization of each cube depending on the size of the graph.

3) Different Network Bandwidth: As mentioned at the endof Section VI-C3, other distributed graph processing systemsmay also adopt the IMIB partitioning method. However,whether the method works depends on the network band-width. Compared with distributed graph processing systems,like Gemini [4], the network bandwidth in GraphH is twoorders of magnitude higher. We depict the proportion of trans-ferring data among cube in the total execution time, when thetotal external bandwidth of a cube varies, in Fig. 18.

As we can see in Fig. 18, if we lower the interconnectionbandwidth to tens of gigabytes (same as Gemini [4]), thenetwork traffic accounts for up to 80.79% of the total execu-tion time when network bandwidth is low. Transferring dataamong cubes using IMIB will domination the total executiontime in that situation. While in GraphH, due to high band-width among cubes with our designs, the network traffic onlyaccounts for less than 20% of the total execution time, sowe balance workloads rather than reduce network traffics inGraphH.

VII. RELATED WORK

A. Processing-in-Memory and 3-D-Stacking

The concept of PIM has been proposed since 1990s [28].The original motivation of PIM is to reduce the cost of mov-ing data and overcome the memory wall. Although adoptingcaches or improving the off-chip bandwidth are also solu-tions to such problems, PIM has its advantages like lowdata fetch cost. One key point in PIM devices is to placemassive computation units inside memory dies with highintegration density, 3-D-stacking technology turns it into real-ity. In 3-D-stacking technology, silicon dies are stacked andinterconnected using TSV in a package. Intel Corporation

presents its first 3-D-stacking Pentium 4 processors in 2004and after that 3-D-stacking technology has raised growingattentions. Several work used 3-D-stacking PIM architectureto accelerate data-intensive applications, like graph process-ing [1], [3], neural computation [12], etc. [29], [30].

B. Large-Scale Graph Processing Systems

Many large-scale graph processing systems have beendeveloped in recent years. These systems execute on differ-ent platforms, including distributed systems [4], [5], [7], [11],single machine systems [8], [14], [31], heterogeneoussystems [32]–[35], etc. [2]. Some distributed systems arebased on big data processing framework [36], [37], they focuson the fault tolerance to provide a stable system. Other dis-tributed graph processing systems focus on other issues likegraph partitioning and real-time requirement. Single machinesystems focus on providing an efficient system under limitedresources (like a personal computer). Heterogeneous systemsuse other devices like GPUs [32] and FPGAs [33]–[35]to accelerate graph computation, while the capacity andbandwidth of these systems may be limited.

C. Large-Scale Graph Processing on Hybrid Memory Cubes

Tesseract [1] and GraphPIM [3] are two PIM-based graphprocessing architecture based on HMCs. By first adoptingPIM in graph processing, Tesseract [1] it is efficient andscalable to the problem with intense memory bandwidthdemands. To fully exploit the potential of PIM on graphprocessing, GraphH proposes specific designs for graph pro-cessing, including hardware support, balancing method, andreconfigurable network, which are not discussed in Tesseract.GraphPIM [3] proposes the solution of offloading instructionsin graph processing to HMC devices. Compared with GraphHand Tesseract, GraphPIM does not introduce extra logics inthe logic layer of HMCs. However, without the design of thecube’s interconnection, HMC in GraphPIM just performs asthe substitute for the conventional memory in a graph process-ing system. On the other hand, GraphH and Tesseract focuson the scalability of using multiple cubes, and providing thesolution of offloading operations in whole graph processingflow to HMCs.

VIII. CONCLUSION

In this paper, we analyze four crucial factors of improvingthe performance of large-scale graph processing. To pro-vide higher bandwidth, we implement an HMC array-basedgraph processing system, GraphH, adopting the concept ofPIM. Hardware specializations like OVB are integrated intoGraphH to avoid random data access. Cubes are connectedusing RDMC to provide high global bandwidth and ensurelocality. We divide large graphs into partitions and then mapthem to the HMC. We balance workloads of partitions usingIMIB. Conflicts among cubes are avoided using RIP schedul-ing method. Then, we propose two optimization methods toreduce global synchronization overhead and reuse on-chip datato further improve the performance of GraphH. According toour experimental results, GraphH scales to large graphs and

Page 13: GraphH: A Processing-in-Memory Architecture for Large

652 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 38, NO. 4, APRIL 2019

outperforms DDR-based graph processing systems by up totwo orders of magnitude and achieves up to 5.12× speedupcompared with Tesseract [1].

REFERENCES

[1] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory accelerator for parallel graph processing,” in Proc. ISCA,Portland, OR, USA, 2015, pp. 105–117.

[2] T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi,“Graphicionado: A high-performance and energy-efficient acceleratorfor graph analytics,” in Proc. MICRO, 2016, pp. 1–13.

[3] L. Nai et al., “GraphPIM: Enabling instruction-level PIM offloading ingraph computing frameworks,” in Proc. HPCA, Austin, TX, USA, 2017,pp. 457–468.

[4] X. Zhu, W. Chen, W. Zheng, and X. Ma, “Gemini: A computation-centric distributed graph processing system,” in Proc. OSDI, Savannah,GA, USA, 2016, pp. 301–316.

[5] Y. Low et al., “Distributed GraphLab: A framework for machine learn-ing and data mining in the cloud,” VLDB Endowment, vol. 5, no. 8,pp. 716–727, 2012.

[6] M. M. Ozdal et al., “Energy efficient architecture for graph analyticsaccelerators,” in Proc. ISCA, 2016, pp. 166–177.

[7] G. Malewicz et al., “Pregel: A system for large-scale graph processing,”in Proc. SIGMOD, 2010, pp. 135–146.

[8] Y. Chi et al., “NXgraph: An efficient graph processing system on asingle machine,” in Proc. ICDE, 2016, pp. 409–420.

[9] L. Ma, Z. Yang, H. Chen, J. Xue, and Y. Dai, “Garaph: EfficientGPU-accelerated graph processing on a single machine with balancedreplication,” in Proc. ATC, 2017, pp. 195–207.

[10] S. Hong, T. Oguntebi, and K. Olukotun, “Efficient parallel graph explo-ration on multi-core CPU and GPU,” in Proc. PACT, Galveston, TX,USA, 2011, pp. 78–88.

[11] K. Zhang, R. Chen, and H. Chen, “NUMA-aware graph-structuredanalytics,” ACM SIGPLAN Notices, vol. 50, no. 8, pp. 183–193, 2015.

[12] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS:Scalable and efficient neural network acceleration with 3D memory,” inProc. ASPLOS, 2017, pp. 751–764.

[13] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay,“Neurocube: A programmable digital neuromorphic architecture withhigh-density 3D memory,” in Proc. ISCA, 2016, pp. 380–392.

[14] A. Roy, I. Mihailovic, and W. Zwaenepoel, “X-stream: Edge-centric graph processing using streaming partitions,” in Proc. SOSP,Farmington, PA, USA, 2013, pp. 472–488.

[15] J. Jeddeloh and B. Keeth, “Hybrid memory cube new DRAM architec-ture increases density and performance,” in Proc. VLSIT, Honolulu, HI,USA, 2012, pp. 87–88.

[16] HMC Consortium. (2014). Hybrid Memory Cube specification 2.1.[Online]. Available: http://hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSRHMCCSpecificationRev2.120151105.pdf

[17] M. G. Smith and S. Emanuel, “Methods of making thru-connections insemiconductor wafers,” U.S. Patent 3,343,256, Sep. 1967.

[18] HP Labs. (1994). CACTI. [Online]. Available: http://www.hpl.hp.com/research/cacti/

[19] G. Kim, J. Kim, J. H. Ahn, and J. Kim, “Memory-centric systeminterconnect design with hybrid memory cubes,” in Proc. PACT, 2013,pp. 145–156.

[20] C. Cakir, R. Ho, J. Lexau, and K. Mai, “Modeling and design of high-radix on-chip crossbar switches,” in Proc. NOCS, 2015, p. 20.

[21] M. Faloutsos, P. Faloutsos, and C. Faloutsos, “On power-law relation-ships of the Internet topology,” ACM SIGCOMM Comput. Commun.Rev., vol. 29, no. 4, pp. 251–262, 1999.

[22] G. Karypis and V. Kumar, “METIS–unstructured graph partitioning andsparse matrix ordering system, version 2.0,” 1995.

[23] Y. Tamir and G. L. Frazier, “High-performance multi-queue buffersfor VLSI communications switches,” ACM SIGARCH Comput. Archit.News, vol. 16, no. 2, May 1988.

[24] (2002). DynamoRIO. [Online]. Available: http://www.dynamorio.org/[25] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “DRAMSim2: A cycle

accurate memory system simulator,” IEEE Comput. Archit. Lett., vol. 10,no. 1, pp. 16–19, Jan./Jun. 2011.

[26] (2011). 10th DIMACS Implementation Challenge. [Online]. Available:http://www.cc.gatech.edu/dimacs10/index.shtml

[27] Y. Eckert, N. Jayasena, and G. H. Loh, “Thermal feasibility of die-stacked processing in memory,” 2014.

[28] M. Gokhale, B. Holmes, and K. Iobst, “Processing in memory: TheTerasys massively parallel PIM array,” Computer, vol. 28, no. 4,pp. 23–31, 1995.

[29] Q. Zhu et al., “A 3D-stacked logic-in-memory accelerator forapplication-specific data intensive computing,” in Proc. 3DIC,San Francisco, CA, USA, 2013, pp. 1–7.

[30] N. Mirzadeh, O. Koçberber, B. Falsafi, and B. Grot, “Sort vs. hash joinrevisited for near-memory execution,” in Proc. ASBD, 2015.

[31] X. Zhu, W. Han, and W. Chen, “GridGraph: Large-scale graph process-ing on a single machine using 2-level hierarchical partitioning,” in Proc.ATC, 2015, pp. 375–386.

[32] F. Khorasani, R. Gupta, and L. N. Bhuyan, “Scalable SIMD-efficientgraph processing on GPUs,” in Proc. PACT, San Francisco, CA, USA,2015, pp. 39–50.

[33] G. Dai et al., “ForeGraph: Exploring large-scale graph processing onmulti-FPGA architecture,” in Proc. FPGA, 2017, pp. 217–226.

[34] G. Dai, Y. Chi, Y. Wang, and H. Yang, “FPGP: Graph processing frame-work on FPGA a case study of breadth-first search,” in Proc. FPGA,2016, pp. 105–110.

[35] E. Nurvitadhi et al., “Graphgen: An FPGA framework for vertex-centric graph computation,” in Proc. FCCM, Boston, MA, USA, 2014,pp. 25–28.

[36] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing onlarge clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.

[37] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,“Spark: Cluster computing with working sets,” in Proc. HotCloud,Boston, MA, USA, 2010, pp. 1–7.

Guohao Dai (S’18) received the B.S. degree fromTsinghua University, Beijing, China, in 2014, wherehe is currently pursuing the Ph.D. degree with theDepartment of Electronic Engineering.

He has been with the Nanoscale IntegratedCircuits and Systems Laboratory, Department ofElectronic Engineering, Tsinghua University, since2014. His current research interests include accel-eration of large-scale graph processing on hardwareand emerging devices.

Tianhao Huang is currently pursuing the seniorundergraduate degree in electronic engineering withTsinghua University, Beijing, China.

He has been with the Nanoscale IntegratedCircuits and Systems Laboratory, Department ofElectronic Engineering since 2016. His currentresearch interests include architecture support forefficient graph processing and parallel computing.

Yuze Chi received the B.S. degree in electronic engi-neering from Tsinghua University, Beijing, China,where he is currently pursuing the Ph.D. degree incomputer science.

His current research interests include graph pro-cessing, image processing, and genomics.

Page 14: GraphH: A Processing-in-Memory Architecture for Large

DAI et al.: GraphH: PIM ARCHITECTURE FOR LARGE-SCALE GRAPH PROCESSING 653

Jishen Zhao (M’10) is an Assistant Professor withthe Computer Science and Engineering Department,University of California at San Diego, La Jolla,CA, USA. Her current research interests includecomputer architecture and system software, with aparticular emphasis on memory and storage systems,domain-specific acceleration, and high-performancecomputing.

Guangyu Sun (M’09) received the B.S. and M.S.degrees from Tsinghua University, Beijing, China,in 2003 and 2006, respectively, and the Ph.D.degree in computer science from Pennsylvania StateUniversity, State College, PA, USA, in 2011.

He is currently an Associate Professor withthe Center for Energy-Efficient Computing andApplications, Peking University, Beijing. His cur-rent research interests include computer architecture,very large-scale integration design as well as elec-tronic design automation. He has published over 60

journals and refereed conference papers in the above areas.Dr. Sun serves as an Associate Editor for ACM TECS and JETC. He has

also served as a Peer Reviewer and a Technical Referee for several journals,which include the IEEE MICRO, the IEEE TRANSACTIONS ON VERY LARGE

SCALE INTEGRATION (VLSI) SYSTEMS, and the IEEE TRANSACTIONS ON

COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS. Heis a member of ACM and CCF.

Yongpan Liu (M’07–SM’15) received the B.S.,M.S., and Ph.D. degrees from the ElectronicEngineering Department, Tsinghua University,Beijing, China, in 1999, 2002, and 2007,respectively.

He was a Visiting Scholar with Penn StateUniversity, State College, PA, USA, in 2014. He isa Key Member of Tsinghua-Rohm Research Centerand Research Center of Future ICs. He is currentlyan Associate Professor with the Department ofElectronic Engineering, Tsinghua University. His

research is supported by NSFC, 863, 973 Program, and industry companiessuch as Huawei, Rohm, Intel, and so on. He has published over 60peer-reviewed conference and journal papers and led over six chip designprojects for sensing applications, including the first nonvolatile processor(THU1010N). His current research interests include nonvolatile computation,low power very large-scale integration design, emerging circuits and systems,and design automation.

Dr. Liu was a recipient of the Design Contest Award from ISLPED 2012and 2013, and the Best Paper Award from HPCA 2015. He has served onseveral conference technical program committees, such as DAC, ASP-DAC,ISLPED, A-SSCC, ICCD, and VLSI-D. He is an ACM Member and anIEICE Member.

Yu Wang (S’05–M’07–SM’14) received the B.S.and Ph.D. (Hons.) degrees from Tsinghua University,Beijing, China, in 2002 and 2007, respectively.

He is currently a tenured Associate Professor withthe Department of Electronic Engineering, TsinghuaUniversity. He is the Co-Founder of Deephi Tech,Beijing, which is a leading deep learning process-ing platform provider. He has published over 50journals (38 IEEE/ACM journals) and over 150 con-ference papers. He has given 70 invited talks and twotutorials in industry/academia. His current research

interests include brain inspired computing, application specific hardware com-puting, parallel circuit analysis, and power/reliability aware system designmethodology.

Dr. Wang was a recipient of the Best Paper Award in ISVLSI 2012and FPGA 2017, the Best Poster Award in HEART 2012 with eight bestpaper nominations (DAC 2017, ASPDAC 2016, ASPDAC 2014, ASPDAC2012, two in ASPDAC 2010, ISLPED 2009, and CODES 2009), the IBMX10 Faculty Award in 2010, and the Natural Science Fund for OutstandingYouth Fund in 2016. He currently serves as the Co-Editor-in-Chief for ACMSIGDA E-Newsletter, an Associate Editor for the IEEE TRANSACTIONS ON

COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, andthe Journal of Circuits, Systems, and Computers. He also serves as a GuestEditor for the Integration, the VLSI Journal and the IEEE TRANSACTIONS

ON MULTI-SCALE COMPUTING SYSTEMS. He served as the TPC Chair forISVLSI 2018 and ICFPT 2011, and the Finance Chair of ISLPED 2012–2016,the Track Chair for DATE 2017 and 2018 and GLSVLSI 2018, and a ProgramCommittee Member for leading conferences in these areas, including topelectronic design automation conferences such as DAC, DATE, ICCAD, ASP-DAC, and top FPGA conferences such as FPGA and FPT. He is currently withACM Distinguished Speaker Program. He is an ACM Senior Member.

Yuan Xie (SM’07–F’15) was with IBM, Armonk,NY, USA, from 2002 to 2003, and AMD ResearchChina Laboratory, Beijing, China, from 2012 to2013. He has been a Professor with PennsylvaniaState University, State College, PA, USA, since2003. He is currently a Professor with the Electricaland Computer Engineering Department, Universityof California at Santa Barbara, Santa Barbara, CA,USA. He has been inducted to ISCA/MICRO/HPCAHall of Fame. His current research interests includecomputer architecture, electronic design automation,

and very large scale integration design.

Huazhong Yang (M’97–SM’00) was born inZiyang, China, in 1967. He received the B.S. degreein microelectronics and the M.S. and Ph.D. degreesin electronic engineering from Tsinghua University,Beijing, China, in 1989, 1993, and 1998, respec-tively.

In 1993, he joined the Department of ElectronicEngineering, Tsinghua University, where he hasbeen a Full Professor since 1998. He has authoredand co-authored over 400 technical papers, sevenbooks, and over 100 granted Chinese patents. His

current research interests include wireless sensor networks, data convert-ers, energy-harvesting circuits, nonvolatile processors, and brain inspiredcomputing.

Dr. Yang was a recipient of the Distinguished Young Researcher by NSFCin 2000 and the Cheung Kong Scholar by the Ministry of Education of Chinain 2012. He has been in charge of several projects, including projects spon-sored by the National Science and Technology major project, 863 Program,NSFC, 9th Five-Year National Program, and several international researchprojects. He has also served as the Chair of the Northern China ACM SIGDAChapter.