ΗΜΥ 408 ΨΗΦΙΑΚΟΣ ΣΧΕΔΙΑΣΜΟΣ ΜΕ fpgas Χειμερινό ... ·...
TRANSCRIPT
ΗΜΥ 408 ΨΗΦΙΑΚΟΣ ΣΧΕΔΙΑΣΜΟΣ ΜΕ FPGAs
Χειμερινό Εξάμηνο 2018 ΔΙΑΛΕΞΗ 8: FPGAs for the Masses
ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ Αναπληρωτής Καθηγητής, ΗΜΜΥ
Slides adopted from: Dr. Christoforos Kachris and Prof. Dimitrios Soudris (ICCS/NTUA, Greece), and Profs. Walid Najjar (UC Riverside, USA) and Paolo Ienne (EPFL, SWITZERLAND)
ΗΜΥ408 Δ8 FPGAs for the Masses.2 ©Theocharides, ECE, 2018
Accelerators in data centers
By 2020, Intel predicts a third of cloud providers will use FPGAs, analysts noted in a keynote at their annual data center event…
ΗΜΥ408 Δ8 FPGAs for the Masses.3 ©Theocharides, ECE, 2018
Data Center Requirements
Traffic requirements increase significantly in the data centers but the power budget remains the same (Source: ITRS, HiPEAC, Cisco)
1
10
2012 2013 2014 2015 2016 2017 2018 2019
Traffic growth in Data centers versous Power constraints
Traffic growth
Heat load per rack
Power per chip
Transistor countTransistors
Traffic growth in Data Centers
Power per chip
Heat load per rack
ΗΜΥ408 Δ8 FPGAs for the Masses.4 ©Theocharides, ECE, 2018
Hardware accelerators
• HW acceleration can be used to reduce significantly the execution time and the energy consumption of several applications (10x-100x)
[Source: Xilinx, 2016]
ΗΜΥ408 Δ8 FPGAs for the Masses.5 ©Theocharides, ECE, 2018
Google application Specific Accelerators deployed in DC
Google Has Built A Custom Chip For Machine Learning
The result is called a Tensor Processing Unit (TPU), a custom ASIC we built specifically for machine learning — and tailored for TensorFlow.
Google has been running TPUs inside the data centers for more than a year, and have found them to deliver an order of magnitude better-optimized performance per watt for machine learning.
This is roughly equivalent to fast-forwarding technology about seven years into the future (three generations of Moore’s Law).
ΗΜΥ408 Δ8 FPGAs for the Masses.6 ©Theocharides, ECE, 2018
A survey on HW accelerator for Cloud computing
HW accelerators Search engine and Page ranking Spark Memcached Databases
FPGAs in the cloud framework
FPL 2016, Christoforos Kachris,
6
ΗΜΥ408 Δ8 FPGAs for the Masses.7 ©Theocharides, ECE, 2018
Web search and Page Ranking
MS Catapult: Bing web search
engine 95% higher
throughput per server
Or, (while maintaining equivalent throughput) Tail latency: reduced by 29%.
ΗΜΥ408 Δ8 FPGAs for the Masses.8 ©Theocharides, ECE, 2018
Spark Accelerator
J. Cong, M. Huang, D. Wu, and C. H. Yu, “Invited – heterogeneous datacenters: Options and opportunities,” in Proceedings of the 53rd Annual Design Automation Conference, ser. DAC ’16. New York, NY, USA: ACM, 2016, pp. 16:1–16:6
When Apache Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration Deploying Accelerators At Datacenter Scale Using Spark, Spark Summit
ΗΜΥ408 Δ8 FPGAs for the Masses.9 ©Theocharides, ECE, 2018
Memcached accelerator
M. Blott, L. Liu, K. Karras, and K. Vissers, “Scaling out to a single-node 80gbps memcached server with 40terabytes of memory,” in Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems, ser. HotStorage’15. Berkeley, CA, USA: USENIX Association, 2015
36x in RPS/Watt with low variation
ΗΜΥ408 Δ8 FPGAs for the Masses.10 ©Theocharides, ECE, 2018
In-memory Databases
Source: [B. Sukhwani, H. Min, M. Thoennes, P. Dube, B. Brezzo, S. Asaad, and D. E. Dillenberger, “Database analytics: A reconfigurable-computing approach,” IEEE Micro, vol. 34, no. 1, pp. 19–29, Jan 2014.]
7x to 14x speedup for most queries
ΗΜΥ408 Δ8 FPGAs for the Masses.11 ©Theocharides, ECE, 2018
Where is the Parallelism?
Multiple tiles process DB pages in parallel Concurrently evaluate multiple records from a page within a tile
- Concurrently evaluate multiple predicates against different columns within a row
ΗΜΥ408 Δ8 FPGAs for the Masses.12 ©Theocharides, ECE, 2018
SQL Databases
Baidu has recently presented an FPGA-based acceleration for data centers for the SQL databases
[Source: Jian Ouyang, Baidu, Hot Chips 2016]
ΗΜΥ408 Δ8 FPGAs for the Masses.13 ©Theocharides, ECE, 2018
A survey on HW accelerator for Cloud computing
HW accelerators Search engine and Page ranking MapReduce Spark Memcached Databases
FPGAs in the cloud framework
ΗΜΥ408 Δ8 FPGAs for the Masses.16 ©Theocharides, ECE, 2018
RC3E, Dresden University
Source: [O. Knodel and R. G. Spallek, “RC3E: provision and management of reconfigurable hardware accelerators in a cloud environment,” in 2nd International Workshop on FPGAs for Software Programmers, 2015]
ΗΜΥ408 Δ8 FPGAs for the Masses.17 ©Theocharides, ECE, 2018
FPGAs in HyperScale Data Centers
Cloud computing Applications
Cloud Orchestrator
3rd party IP developersLibrary of Hardware
accelerators as IP Blocks
• Resource Manager• Scheduler• Acceleration Controllers
Heterogeneous Data Center
Proc. Proc.+GPUs Proc.+FPGAs
IP Acc/App store
Cloud tenants
The ecosystem of Hardware IPs in the embedded system world can be adopted in the data centers.
Accelerators IPs can foster the innovation of IPs in the domain of cloud computing and big data analytics
ΗΜΥ408 Δ8 FPGAs for the Masses.18 ©Theocharides, ECE, 2018
Scaling Reverse Time Migration Performance Through Reconfigurable Dataflow Engines
Haohan Fu1, Lin Gan1, Robert G Clapp2, Huabin Ruan1, Oliver Pell3, Oskar Mencer3, Michael Flynn2,
Xiaomeng Huang1, and Guangwen Yang1
1Tsinghua University 2Stanford University 3Maxeler Technologies
A Real Example!
ΗΜΥ408 Δ8 FPGAs for the Masses.19 ©Theocharides, ECE, 2018
Migration (Geology)
https://upload.wikimedia.org/wikipedia/commons/3/38/GraphicalMigration.jpg
ΗΜΥ408 Δ8 FPGAs for the Masses.20 ©Theocharides, ECE, 2018
Reverse Time Migration (RTM) Imaging algorithm
Used for oil and gas exploration
Computationally demanding
ΗΜΥ408 Δ8 FPGAs for the Masses.21 ©Theocharides, ECE, 2018
RTM Pseudocode
Iterate over time-steps, and 3D grids
Iterations over shots (sources) are independent and easy to parallelize
Iterate over time-steps, and 3D grids
Propagate source wave fields from time 0 to nt - 1
Propagate receiver wave fields from time nt - 1 to 0
Cross-correlate the source and receiver wave field at the same time step to accumulate the result
Add the recorded source signal to the corresponding location
Add the recorded receiver signal to the corresponding location
Boundary conditions
Boundary conditions
ΗΜΥ408 Δ8 FPGAs for the Masses.22 ©Theocharides, ECE, 2018
RTM Computational Challenges Cross-correlate source and receiver signals
Source/receiver wave signals are computed in different directions in time
The size of a source wave field for one time-step can be 0.5 to 4 GB
Checkpointing: store source wave field and certain time steps and recompute the remaining steps when needed
Memory access pattern Neighboring points may be distant in the memory space High cache miss rate (when the domain is large)
ΗΜΥ408 Δ8 FPGAs for the Masses.25 ©Theocharides, ECE, 2018
Performance Tuning
Optimization strategies Algorithmic requirements Hardware resource limits
Balance resource utilization so that none becomes a bottleneck LUTs DSP Blocks block RAMs I/O bandwidth
ΗΜΥ408 Δ8 FPGAs for the Masses.26 ©Theocharides, ECE, 2018
Custom BRAM Buffers
37 pt. Star Stencil on a MAX3 DFE
• 24 concurrent pipelines at 125 MHz
• Concurrent access to 37 points per cycle
• Internal memory bandwidth of 426 Gbytes/sec
ΗΜΥ408 Δ8 FPGAs for the Masses.27 ©Theocharides, ECE, 2018
More Parallelism Process multiple points concurrently
Demands more I/O
Cascade multiple time steps in a deep pipeline Demands more buffers
ΗΜΥ408 Δ8 FPGAs for the Masses.28 ©Theocharides, ECE, 2018
Number Representation
32-bit floating-point was default
Convert many variables to 24-bit fixed-point Smaller pipelines => MORE pipelines
Floating-point - 16,943 LUTs - 23,735 flip-flops - 24 DSP48Es
Fixed-point - 3,385 LUTs - 3,718 flip-flops - 12 DSP48Es