ΗΜΥ 408 ΨΗΦΙΑΚΟΣ ΣΧΕΔΙΑΣΜΟΣ ΜΕ fpgas Χειμερινό ... ·...

ΗΜΥ 408 ΨΗΦΙΑΚΟΣ ΣΧΕΔΙΑΣΜΟΣ ΜΕ FPGAs

Χειμερινό Εξάμηνο 2018 ΔΙΑΛΕΞΗ 8: FPGAs for the Masses

ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ Αναπληρωτής Καθηγητής, ΗΜΜΥ

([email protected])

Slides adopted from: Dr. Christoforos Kachris and Prof. Dimitrios Soudris (ICCS/NTUA, Greece), and Profs. Walid Najjar (UC Riverside, USA) and Paolo Ienne (EPFL, SWITZERLAND)

ΗΜΥ408 Δ8 FPGAs for the Masses.2 ©Theocharides, ECE, 2018

Accelerators in data centers

By 2020, Intel predicts a third of cloud providers will use FPGAs, analysts noted in a keynote at their annual data center event…


Data Center Requirements

Traffic requirements increase significantly in the data centers but the power budget remains the same (Source: ITRS, HiPEAC, Cisco)

1

10

2012 2013 2014 2015 2016 2017 2018 2019

Traffic growth in Data centers versous Power constraints

Traffic growth

Heat load per rack

Power per chip

Transistor countTransistors

Traffic growth in Data Centers

Power per chip

Heat load per rack


Hardware accelerators

• HW acceleration can be used to reduce significantly the execution time and the energy consumption of several applications (10x-100x)

[Source: Xilinx, 2016]


Google application Specific Accelerators deployed in DC

Google Has Built A Custom Chip For Machine Learning

The result is called a Tensor Processing Unit (TPU), a custom ASIC we built specifically for machine learning — and tailored for TensorFlow.

Google has been running TPUs inside the data centers for more than a year, and have found them to deliver an order of magnitude better-optimized performance per watt for machine learning.

This is roughly equivalent to fast-forwarding technology about seven years into the future (three generations of Moore’s Law).

https://www.tensorflow.org/


A survey on HW accelerator for Cloud computing

HW accelerators Search engine and Page ranking Spark Memcached Databases

FPGAs in the cloud framework

FPL 2016, Christoforos Kachris,

6


Web search and Page Ranking

MS Catapult: Bing web search

engine 95% higher

throughput per server

Or, (while maintaining equivalent throughput) Tail latency: reduced by 29%.


Spark Accelerator

J. Cong, M. Huang, D. Wu, and C. H. Yu, “Invited – heterogeneous datacenters: Options and opportunities,” in Proceedings of the 53rd Annual Design Automation Conference, ser. DAC ’16. New York, NY, USA: ACM, 2016, pp. 16:1–16:6

When Apache Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration Deploying Accelerators At Datacenter Scale Using Spark, Spark Summit


Memcached accelerator

M. Blott, L. Liu, K. Karras, and K. Vissers, “Scaling out to a single-node 80gbps memcached server with 40terabytes of memory,” in Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems, ser. HotStorage’15. Berkeley, CA, USA: USENIX Association, 2015

36x in RPS/Watt with low variation


In-memory Databases

Source: [B. Sukhwani, H. Min, M. Thoennes, P. Dube, B. Brezzo, S. Asaad, and D. E. Dillenberger, “Database analytics: A reconfigurable-computing approach,” IEEE Micro, vol. 34, no. 1, pp. 19–29, Jan 2014.]

7x to 14x speedup for most queries


Where is the Parallelism?

Multiple tiles process DB pages in parallel Concurrently evaluate multiple records from a page within a tile

- Concurrently evaluate multiple predicates against different columns within a row


SQL Databases

Baidu has recently presented an FPGA-based acceleration for data centers for the SQL databases

[Source: Jian Ouyang, Baidu, Hot Chips 2016]


A survey on HW accelerator for Cloud computing

HW accelerators Search engine and Page ranking MapReduce Spark Memcached Databases

FPGAs in the cloud framework


IBM’s OpenPower IP Store


Intel’s vision on IP Store


RC3E, Dresden University

Source: [O. Knodel and R. G. Spallek, “RC3E: provision and management of reconfigurable hardware accelerators in a cloud environment,” in 2nd International Workshop on FPGAs for Software Programmers, 2015]


FPGAs in HyperScale Data Centers

Cloud computing Applications

Cloud Orchestrator

3rd party IP developersLibrary of Hardware

accelerators as IP Blocks

• Resource Manager• Scheduler• Acceleration Controllers

Heterogeneous Data Center

Proc. Proc.+GPUs Proc.+FPGAs

IP Acc/App store

Cloud tenants

The ecosystem of Hardware IPs in the embedded system world can be adopted in the data centers.

Accelerators IPs can foster the innovation of IPs in the domain of cloud computing and big data analytics


Scaling Reverse Time Migration Performance Through Reconfigurable Dataflow Engines

Haohan Fu1, Lin Gan1, Robert G Clapp2, Huabin Ruan1, Oliver Pell3, Oskar Mencer3, Michael Flynn2,

Xiaomeng Huang1, and Guangwen Yang1

1Tsinghua University 2Stanford University 3Maxeler Technologies

A Real Example!


Migration (Geology)

https://upload.wikimedia.org/wikipedia/commons/3/38/GraphicalMigration.jpg




Reverse Time Migration (RTM) Imaging algorithm

Used for oil and gas exploration

Computationally demanding


RTM Pseudocode

Iterate over time-steps, and 3D grids

Iterations over shots (sources) are independent and easy to parallelize

Iterate over time-steps, and 3D grids

Propagate source wave fields from time 0 to nt - 1

Propagate receiver wave fields from time nt - 1 to 0

Cross-correlate the source and receiver wave field at the same time step to accumulate the result

Add the recorded source signal to the corresponding location

Add the recorded receiver signal to the corresponding location

Boundary conditions

Boundary conditions


RTM Computational Challenges Cross-correlate source and receiver signals

Source/receiver wave signals are computed in different directions in time

The size of a source wave field for one time-step can be 0.5 to 4 GB

Checkpointing: store source wave field and certain time steps and recompute the remaining steps when needed

Memory access pattern Neighboring points may be distant in the memory space High cache miss rate (when the domain is large)


Hardware


General Architecture


Performance Tuning

Optimization strategies Algorithmic requirements Hardware resource limits

Balance resource utilization so that none becomes a bottleneck LUTs DSP Blocks block RAMs I/O bandwidth


Custom BRAM Buffers

37 pt. Star Stencil on a MAX3 DFE

• 24 concurrent pipelines at 125 MHz

• Concurrent access to 37 points per cycle

• Internal memory bandwidth of 426 Gbytes/sec


More Parallelism Process multiple points concurrently

Demands more I/O

Cascade multiple time steps in a deep pipeline Demands more buffers


Number Representation

32-bit floating-point was default

Convert many variables to 24-bit fixed-point Smaller pipelines => MORE pipelines

Floating-point - 16,943 LUTs - 23,735 flip-flops - 24 DSP48Es

Fixed-point - 3,385 LUTs - 3,718 flip-flops - 12 DSP48Es


Results


Roadmap Paradigm shift (From Homogeneous Data Centers to

Heterogeneous Data Centers)

3rd party Hardware IP developers contribute to a common market place for Hardware Accelerators in the same way as Embedded systems

ΗΜΥ 408 ΨΗΦΙΑΚΟΣ ΣΧΕΔΙΑΣΜΟΣ ΜΕ fpgas Χειμερινό ... ·...

Documents