the glass half fullweb.fdi.ucm.es/posgrado/conferencias/zsoltistvan-slides.pdf · x100 [cidr05]...
TRANSCRIPT
-
The Glass Half Full Using Programmable Hardware Accelerators in
Analytical Databases
Zsolt IstvánIMDEA Software Institute
1
-
IMDEA Software Institute
• 16 Faculty in the areas of:• Program Analysis and Verification• Languages and Compilers• Security and Privacy• Theoretical Computer Science• Distributed Systems and Databases
• ~10 Post-docs, ~25 PhD Students, ~10 Interns
• Located in UPM Montegancedo Campus, Madrid
• We are hiring! https://software.imdea.org/
-
▪ OLAP – Online Analytical Processing▪ Large datasets – up to TBs
▪ Ad-hoc querying to extract insight, recurring reporting – Possibly complex operations
▪ Read-mostly workloads, updates in batches
▪ OLTP – Online Transaction Processing▪ Smaller datasets
▪ Queries known, relate to business actions
▪ Makes heavy use of indexes
▪ Reads and updates intermixed
3
Context: Analytical Databases
-
4
Databases were a 25 Billion $ market in 2018…
Could we specialize machines to them?
https://www.statista.com/statistics/810188/worldwide-commercial-database-market-size/
-
▪ Fully custom machine for databases▪ Processors – special ISA microprocessors
▪ Memory – magnetic bubbles and CCDs
▪ Semiconductor technology and general purpose CPUs took over
5
Database Computer – ’70s
“The first goal is to design it with the
capability of handling a very large on-line
database of 10^10 bytes or beyond since
special-purpose machines are not likely to
be cost-effective for small databases.”
Jayanta Banerjee, David K. Hsiao, Krishnamurthi Kannan: DBC - A Database Computer for Very Large Databases.
IEEE Trans. Computers 28(6): 414-429 (1979)
-
▪ Based on VAX multi-processor system
▪ By the time the software and hardware were developed, CPUs have become much faster ▪ Couldn’t keep up with
Moore’s law
6
Gamma Machine – ’80s
David J. DeWitt, Robert H. Gerber, Goetz Graefe, Michael L. Heytens, Krishna B. Kumar, M. Muralikrishna: GAMMA - A High
Performance Dataflow Database Machine. VLDB 1986: 228-237
-
7
Data/Compute Gap
CPU Scaling Commodity in CloudSpecialized
Hardware
Revival
-
8
Renewed interest in Specialized Hardware
ASICsFPGAsCPUs
-
Field Programmable Gate Array (FPGA)
▪ Free choice of architecture
▪ Fine-grained pipelining, communication, distributed memory
▪ Tradeoff: all “code” occupies chip space
▪ Evolving platform: larger chips, more heterogeneity
9
Re-programmable Specialized Hardware
Op 1
Op 2
Op 3
-
10
Integration Options
Accel.
1) On the side 2) In data-path 3) Co-processor
Data
Data Data
Accel.
Accel.
-
▪ Accelerator▪ Amazon F1
▪ In data path▪ Microsoft Catapult
▪ Co-processor▪ Intel Xeon+FPGA
11
In the Cloud Today
Socket1 Socket2
CPU FPGA
Socket1
CPU FPGACPU
FPGA
Intel Xeon+FPGA Gen.1 Intel Xeon+FPGA Gen.2
-
12
The Glass Half Empty…
-
▪ 1) On the side acceleration introduces overhead
▪ Many related work offers no real speedup if we factor in data movement, transformation, software overhead… 13
The Glass Half Empty…
0
20
40
60
80
100
120
Software With Acceleration
Query execution time
Compute Data Movement
2xAccel.Data
-
▪ 2) “All or nothing” behavior makes query planning difficult▪ Example: fixed capacity hash table on FPGA
▪ Constant time access for reads and writes
▪ What happens if data doesn’t fit?
▪ Can’t always know the number of keys aprioi
14
The Glass Half Empty…
#
-
▪ 3) Analytical databases becoming more optimized / not much compute in core SQL
▪ X100 [CIDR05] showed that
-
▪ On the side acceleration introduces overhead
▪ “All or nothing” behavior makes query planning difficult
▪ Analytical databases becoming more optimized / not much compute in core SQL
16
The Glass Half Empty…
-
▪ On the side acceleration introduces overhead
✓ Reduce data movement bottlenecks
17
The Glass Half Full…
-
▪ IBEX: Database storage engine with processing offload▪ Filter and pre-aggregate for analytic workloads
18
Processing in data path: Smart Flash
Database Server
IBEX
SSD
IBEX – An Intelligent Storage Engine with Support for Advanced SQL Off-loading. L. Woods, Z. Istvan
and G. Alonso, VLDB’14
→ Larger bandwidth, more IOPS (Samsung YourSQL, MIT BlueDBM)
▪ Opportunity to extend SSDs/Flash with complex offload
Samsung “smart” SSD
-
19
Processing in data path: Distributed Processing
Wo
rke
rs (
Com
pu
te)
Sto
rag
e
+ Provisioning
+ Scalability
Caribou: Distributed
storage with processing
• Specialized HW nodes
• 10Gbps access
• 25W power cons.
Zsolt István, David Sidler, Gustavo Alonso: Caribou: Intelligent Distributed Storage. PVLDB 10(11), 2017.
-
20
Smart Storage in Databases: Filter push-down
Intel Hyperscan library (Xeon E5-2680 v2)
2.8x
SELECT … FROM customer
WHERE age2
AND address LIKE “%PO. Box 123%”
▪ Challenge: guarantee that filtering never slows down retrieval
▪ Algorithms can be re-imagined to become bandwidth-bound instead of compute-bound▪ Extend the state of the art: parameterization without re-programming [FCCM16]
▪ Many options: Regular expressions, comparisons, decompression, …
[FCCM16] Runtime Parameterizable Regular Expression Operators for Databases. Zs. Istvan, D. Sidler, G. Alonso. FCCM’16
-
✓ Reduce data movement bottlenecks
▪ “All or nothing” behavior makes query planning difficult
✓ Hybrid processing
21
The Glass Half Full…
-
▪ Group-by: Compute aggregate function over categories▪ select avg(salary) from employees group by department
22
IBEX’s Hybrid Group-by
CPUIbex with SW-only Group-By
Projection Selection Group-byFinal
Groups
Input tableFiltered
data
-
▪ Group-by: Compute aggregate function over categories▪ select avg(salary) from employees group by department
23
IBEX’s Hybrid Group-by
CPUIbex with HW-only Group-By
Projection Selection Group-byFinal
Groups
Input tableFiltered
data
-
CPUIbex with HW-only Group-By
Projection Selection Group-byFinal
Groups
Input tableFiltered
data
▪ Group-by: Compute aggregate function over categories▪ select avg(salary) from employees group by department
▪ If number of groups does not fit on FPGA? ▪ Send partial aggregates – finalize in SW
▪ Worst case: same as no acceleration
▪ Best- case: All in HW!
24
IBEX’s Hybrid Group-by
CPUIbex with Hybrid Group-by
Input table Projection Selection Group-by Group-byFinal
Groups
Filtered data
Partial Group
s
Challenge: How to split across accelerator and software?
-
✓ Reduce data movement bottlenecks
✓ Hybrid Processing
▪ Analytical databases becoming more optimized / not much compute in core SQL
✓ Emerging compute-intensive workloads
25
The Glass Half Full
-
▪ Databases adopting new ways of analyzing the data▪ SAP Hana, Oracle, SQL Server, etc.
▪ Specialized hardware can help both with model building [Kara18], inference [Owaida18]
▪ Benefits for “classical” algorithms as well
[Kara18] Kara et al: ColumnML: Column-Store Machine Learning with On-The-Fly Data Transformation. PVLDB 12(4): 348-361 (2018)
[Owaida18] Owaida et al: Application Partitioning on FPGA Clusters: Inference over Decision Tree Ensembles. FPL 2018: 295-30026
The Rise of Machine Learning
-
CPUFPGA Co-processor
27
doppioDB: a hybrid database engine
Database
Engine
(MonetDB)
Hardware
OperatorSoftware
operator
Software
operator
Hardware
Operator
Hardware
Operator
▪ Goal: extend the capabilities of analytical databases▪ FPGA works on the same data as software (cache-coherent access)
▪ Can combine SW and HW operators inside the same query
▪ Challenge: ensure high utilization of FPGA, use in many queries
DRAM (DB Tables)
No data copy,
transformation,
partitioning, etc.
Hardware
Operator
-
K-means – Algorithm
◼ Goal: partition unlabeled data into several
clusters, where the number of clusters is
the “k” in the k-means.
◼ Two steps in each iteration:
◼ Assignment: assign data points to
closet centroid according to distance
metric
◼ Centroid update: the centroids are re-
calculated by averaging all the data
points within each cluster
◼ Long process if the data set and number of
iterations are large
28
-
DRAM
(DB Tables)
Design – Execution Walk-Through
Receives K-Means parameters1
Fetch the initial centroids and
the data
2
3 Calculates the distance between a data point and all the centroids
and assign it to closest centroid
4 Accumulates data points per cluster and counts how many data points are assigned to
each cluster
Collect partial results from each pipeline5
Division for updating new centroid6
Writes back the final results7
1
2
3 4
56
7
Zhenhao He, David Sidler, Zsolt István, Gustavo Alonso: A Flexible K-Means Operator for Hybrid Databases. FPL 201829
-
30
Uses of Parallelism
K is known /
Centroids
known
Need to determine K
(Elbow method)
▪ K-Means algorithm▪ FPGA outperforms several cores of the CPU
▪ Can use parallelism in two ways – cover more queries
-
▪ Text: Regular expression matching, Edit distance, …
▪ Database ops.: Skyline queries, Group-by aggregations, …
▪ Statistics: Histograms, Count-min sketch, Bloom filters, …
▪ Machine learning: Clustering (K-means), Stochastic Gradient Descent, Decision Trees, …
▪ Data management: Hash tables, hash functions, …
▪ [Your algorithm here]
31
Wide range of algorithms can benefit from hardware
-
✓ Reduce data movement bottlenecks
✓ Hybrid Processing
✓ Emerging compute-intensive workloads
32
The Glass Half Full…
Future Challenges…
▪ Managing Programmable Hardware accelerators▪ Is this the job of the OS or does the DB has to take control?
▪ How to share programmable hardware across tenants
▪ Compilation/synthesis of hardware accelerators▪ Can we derive accelerators from user queries?
▪ Intermediary DSL or building blocks we could use?
For more details, see: The Glass Half Full: Using Programmable Hardware Accelerators in Analytics. Z. István. IEEE Data Engineering
Bulletin, March 2019.