energy efficient architecture for graph analytics...

22
Energy Efficient Architecture for Graph Analytics Accelerators ISCA’16 Mustafa Ozdal * , Serif Yesil * , Taemin Kim , Andrey Ayupov , John Greth , Steven M. Burns , Ozcan Ozturk * * Bilkent University, Ankara, Turkey Intel Corporation, Oregon, USA

Upload: others

Post on 02-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Energy Efficient Architecture for Graph Analytics Accelerators

ISCA’16

Mustafa Ozdal*, Serif Yesil*, Taemin Kim†, Andrey Ayupov†, John Greth†, Steven M. Burns†, Ozcan Ozturk*

* Bilkent University, Ankara, Turkey† Intel Corporation, Oregon, USA

Page 2: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Motivation Dark silicon era

Accelerator rich architectures: Customized hardware for specific applications

Hardware design is complex and time consuming

Many applications. Which ones to accelerate? Months of design effort.

Template based design: Capture commonalities for a domain

2

Page 3: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Graph Analytics

Model relationships between individual entities

Emerging application areas:Social networks, web, recommender systems, …

Example applications: PageRank, Collaborative Filtering, Loopy Belief Propagation, Betweenness Centrality, …

Graph-level parallelism & iterative algorithms

3

from Wikimedia

Page 4: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Graph Accelerator TemplateTargeted Graph Computation Pattern: Vertex-centric & Gather - Apply - Scatter (GAS)

We propose: Energy efficient accelerator architecture for irregular graph applications

Well-defined template to plug in different applications

Synthesizable SystemC models for architecture exploration & hardware generation

Design Productivity & Efficiency: Template code size : 39K lines, user code size 43 lines for PageRank

PageRank: 65X better power efficiency than 24 cores of Xeon CPU

4

Page 5: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Outline

Targeted Application Characteristics

Graph-Parallel Abstraction

Proposed Architecture

Experimental Results

5

Page 6: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Graph Analytics

Different than traditional HPC Irregular data access & communication

Poor cache locality

Computation-to-communication ratio very low

Irregular topologies due to scale-free graphs

Convergent algorithms Throughput vs. work-efficiency

Different implementation choices

High throughput easier to achieve than work efficiency

6

Page 7: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Asymmetric Convergence

Processing all vertices in every iteration is not work-efficient!

7

PageRank Execution

7% converge in 1 iteration

51% converge in 36 iterations

99.7% converge in 50 iterations

100% converge in 77 iterations

Similar observation was made in: Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J.M. Hellerstein, “Distributed Graphlab: A framework for machine learning and data mining in the cloud,” In Proc. of VLDB Endow., vol. 5, pp. 716-727, 2012

about 2x more edges processed for PageRank!

Page 8: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Synchronous vs. Asynchronous Execution

8

Jacobi iteration formula for PageRank:

𝑟𝑘+1 𝑣 =1 − 𝛼

𝑁+ 𝛼

(𝑢→𝑣)

𝑟𝑘(𝑢)

𝑑𝑒𝑔𝑟𝑒𝑒(𝑢)

Synchronous: All vertices are updated simultaneously.

Gauss-Seidel iteration formula for PageRank:

𝑟𝑘+1 𝑣 = 1 − 𝛼 + 𝛼σ 𝑢<𝑣(𝑢→𝑣)

𝑟𝑘+1(𝑢)

𝑑𝑒𝑔𝑟𝑒𝑒(𝑢)+ 𝛼σ 𝑢>𝑣

(𝑢→𝑣)

𝑟𝑘(𝑢)

𝑑𝑒𝑔𝑟𝑒𝑒(𝑢)

Asynchronous: Updates to a vertex are visible to others in the same iteration.Observed to be much faster to converge! (30-50% less work)

Page 9: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Throughput vs. Work Efficiency

9

Process all vertices

Easier to implement

High throughput

Worse work efficiency

Process active vertices only

Maintain worklist, dynamic work assignment

Lower throughput

Better work efficiency

Asymmetric Convergence

Synchronous

Easier to implement

High throughput

Worse work efficiency

Asynchronous

Fine-grain synchronization, sequential consistency support

Lower throughput

Better work efficiency

Iterative Execution Model

Ozdal, et. al. ICCAD 2015

Page 10: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Outline

Targeted Application Characteristics

Graph-Parallel Abstraction

Proposed Architecture

Experimental Results

10

Page 11: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Gather-Apply-Scatter Abstraction Abstraction proposed by Graphlab for distributed computing (Low, et. al. VLDB 2012)

Data structures associated with each vertex and edge

Compute operations defined for 3 stages of a vertex program:

11

GATHER APPLY SCATTER

Page 12: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Outline

Targeted Application Characteristics

Graph-Parallel Abstraction

Proposed Architecture

Experimental Results

12

Page 13: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

13

ACCELERATOR UNIT

Active List Mgr: Maintains active vertices

Runtime: Schedules vertex computation

Gather Unit: Accumulates data from neighbors for a vertex

Apply Unit: Performs main computation for a vertex using gather results

Scatter Unit: Distributes the new data to neighbors; activates neighbors

Memory modules: Customized per graph data type

Page 14: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Compute UnitsGather Unit Neighbor vertices and edges accessed. Poor cache locality!

Latency tolerant: Tens of vertices and hundreds of edges processed concurrently. High MLP!

Storage for partial vertex and edge states with dynamic load balancing

Dependency between neighboring vertices handled through Sync Unit

Apply Unit Computation done on local data only

Scatter Unit Similar to Gather Unit

Memory writes in addition to reads

Neighbor vertex activations

14

Page 15: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Control UnitsSync Unit Ensures race-free and sequentially-consistent execution of vertices

Maintains execution states of vertices and assigns a rank for each vertex

Guarantees the proper RAW and WAR ordering for neighboring vertices

High-throughput processing

Active List Manager Active vertices stored in main memory with efficient caching

High-throughput access mechanisms

Race-free simultaneous accessed without explicit locks

Coordinates with Sync Unit for asynchronous execution

15

Page 16: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Multiple Accelerator Units

Banked design: Each unit responsible for a static subset of vertices

Two global light-weight modules: GTD: Global Termination Detector

GRC: Global Rank Counter

16

Page 17: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Outline

Targeted Application Characteristics

Graph-Parallel Abstraction

Proposed Architecture

Experimental Results

17

Page 18: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Benchmarks

Applications PageRank (PR)

Single Source Shortest Path (SSSP)

Stochastic Gradient Descent (SGD)

Loopy Belief Propagation (LBP)

Datasets PR & SSSP: 6 datasets from Snap and generated with Graph500 (up to 1B edges)

LBP: 3 images generated with GraphLab’s synthetic image generator (up to 18M edges)

SGD: 2 movie datasets from MovieLens (up to 10M edges)

18

Page 19: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Experimental SetupBaseline CPU 2-socket 24-core IvyBridge Xeon with 30MB LLC and 132GB of main memory

Optimized software implementations in OpenMP/C++

Running Average Power Limit (RAPL) to estimate energy

Projected DDR3 power (measured) to DDR4 power (in-house DDR4 model)

Proposed Accelerator Performance: Cycle accurate SystemC model + DRAMSim2

Accelerator power and area: HLS + physical-aware logic synthesis with a 22nm industrial library

Cache power and area: CACTI models

DRAM power: in-house DDR4 model

19

Page 20: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Performance Comparison

20

0

1

2

3

4

5

Accelerator Speed Up

24-cores 12-cores

Page 21: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Power Comparison

Accelerator power is dominated by DRAM power. Improvements would be ~10x higher without DRAM power

21

0

10

20

30

40

50

60

70

CPU Power / ACC Power

24-cores 12-cores

Page 22: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016  · Energy Efficient Architecture for Graph

Conclusions A template architecture for graph-analytics is proposed

Latency tolerance for irregular accesses

Graph-parallel execution with sequential consistency

Asynchronous execution and active vertex set support

Synthesizable and cycle-accurate SystemC models

Different accelerators generated by plugging in app-specific functions

Template code size : 39K lines, user code size 43 lines for PageRank

Experiments with 22nm industrial libraries: Performance comparable with a 24-core Xeon system (except SSSP)

Up to 65x less power

22