overview on hardware optimizations for database engines

27
Overview on Hardware Optimizations for Database Engines Annett Ungethüm, Dirk Habich, Tomas Karnagel, Sebastian Haas, Eric Mier, Gerhard Fettweis, Wolfgang Lehner BTW 2017, Stuttgart, Germany, 2017-03-09

Upload: others

Post on 03-Apr-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overview on Hardware Optimizations for Database Engines

Overview on Hardware Optimizations for Database EnginesAnnett Ungethüm, Dirk Habich, Tomas Karnagel, Sebastian Haas, Eric Mier, Gerhard Fettweis, Wolfgang Lehner

BTW 2017, Stuttgart, Germany, 2017-03-09

Page 2: Overview on Hardware Optimizations for Database Engines

2

Interaction DB-Engine and Hardware

Applications/Database Engines

Modern Hardware

Well-Known Challenge:Exploit hardware technology by specific data management techniques (indexing, data storage, query & transaction processing)

1970 1980 1990 2000 2010 2020

10

100

1000

10000

1e+05

1e+06

1e+07 memory (KByte)

1970 1980 1990 2000 2010 2020

0

2

4

6

8

10 #cores

Main Memory CPU

Page 3: Overview on Hardware Optimizations for Database Engines

3

Era of Dark Silicon

 MOORE‘S LAW

§ Number of transistors in a dense integrated circuit doubles approximately every two years.

 DARK SILICON

§ We can no longer power the transistors that Moore is giving us

1970 1980 1990 2000 2010 2020

110

1001000

100001e+051e+061e+07 #transistors (x1000)

process (nm)

http://engineering.nyu.edu/garg/node/31

Page 4: Overview on Hardware Optimizations for Database Engines

4

HW/SW Co-Design for DB-Engines

Applications/Database Engines

Modern Hardware

Challenge:HW/SW Co-Design for Database EnginesSpecialization of Hardware to overcome Dark Silicon

Page 5: Overview on Hardware Optimizations for Database Engines

5

Outline

 HARDWARE FOUNDATION

 EXTENSIONS FOR PROCESSING ELEMENTS

 INTELLIGENT DMA CONTROLLER

Page 6: Overview on Hardware Optimizations for Database Engines

6

Hardware Foundation

 TOMAHAWK PLATFORM

Page 7: Overview on Hardware Optimizations for Database Engines

7

Hardware Foundation – Zoom In

Page 8: Overview on Hardware Optimizations for Database Engines

8

Hardware Foundation – Zoom In (2)

 CORE MANAGER (CM)

§ Extended Xtensa-LX5 from Tensilica (now Cadence)

§ 32KB for code§ 64KB for data

 PROCESSING ELEMENTS (PE)

§ Xtensa-LX5 from Tensilica (now Cadence)§ 32KB for code

§ 2x32KB for data on PE

 APPLICATION CORE (APP)

§ 570T core from Tensilica (now Cadence)

Control-Plane

Control-Plane

Page 9: Overview on Hardware Optimizations for Database Engines

9

Outline

Control-Plane

Control-Plane

 PART I:

EXTENSIONS OF PROCESSING ELEMENTS

Page 10: Overview on Hardware Optimizations for Database Engines

10

Development Flow

DEVELOPMENT OF INSTRUCTION SET EXTENSIONS WITH

TENSILICA TOOLS

§ Tensilica Instruction Extension (TIE) language

§ C/TIE compiler§ Cycle accurate simulator/debugger

§ Processor generator

SYNTHESIS OF RTL CODE

§ Synopsys Design Compiler, PrimeTime PX

§ TSMC CMOS LP 65nm libraries

int res= (v0 + v1 + v2) >> shift8;

// shift8 -> internal stateint res=add3_shift(v0, v1, v2);

Page 11: Overview on Hardware Optimizations for Database Engines

11

Investigated Database Primitives

Bitmap Compression and Processing (AND,

OR, XOR)

Hashing Sorted Set Operations

WA

H

PLW

AH

CO

MP

AX

Has

h+

Lo

oku

p

Has

h +

Inse

rt

Has

h K

eys

Has

h S

amp

ling

Cit

yHas

h3

2

Me

rge

So

rt

Inte

rse

ctio

n

Un

ion

Dif

fere

nce

Sort

-Me

rge

Jo

in

Sort

-Me

rge

A

gg

reg

atio

n (

SUM

)

Primivites

2014

Page 12: Overview on Hardware Optimizations for Database Engines

12

Basic RISC Instruction Set

Application-Specific Instruction Set

Instruction Set

Application-Specific States

Application-Specific Registers

Basic Registers

Register Files

Instructionfetch

Load-Store Unit 0

Load-Store Unit 1

Data Prefetcher

Inte

rcon

ne

ct

Local InstructionMemory

Local Data Memory 0

Local Data Memory 1

Extended Tensilica LX5 Processor

64 bit

128 bit

128 bit

General Approach for all Extensions

Page 13: Overview on Hardware Optimizations for Database Engines

13

Bitmap Primitives

 BITMAPS ARE A SPECIAL KIND OF INDEX  BITMAPS COMPRESSION

§ bit length equals number of tuples

 WORD-ALIGNED HYBRID (WAH) CODE

§ Stateless compression§ Run-length-encoding (RLE)

- run of 0‘s and 1‘s

§ WAH bitmaps contain RLE- compressed fills and

- uncompressed literals

Compressing Bitmap Indexes for Faster Search Operations

Kesheng Wu, Ekow J. Otoo and Arie ShoshaniLawrence Berkeley National Laboratory

Berkeley, CA 94720, USAEmail: {kwu, ejotoo, ashoshani}@lbl.gov

Abstract

In this paper, we study the effects of compression onbitmap indexes. The main operations on the bitmaps dur-ing query processing are bitwise logical operations such asAND,OR,NOT, etc.Using the general purpose compres-sion schemes, such as gzip, the logical operations on thecompressed bitmaps are much slower than on the uncom-pressed bitmaps. Specialized compression schemes, likethe byte-aligned bitmap code (BBC), are usually fasterin performing logical operations than the general purposeschemes, but in many cases they are still orders of magni-tude slower than the uncompressed scheme. To make thecompressed bitmap indexes operate more efficiently, wedesigned a CPU-friendly scheme which we refer to as theword-aligned hybrid code (WAH). Tests on both syntheticand real application data show that the new scheme sig-nificantly outperforms well-known compression schemesat a modest increase in storage space. Compared to BBC,a scheme well-known for its operational efficiency, WAHperforms logical operations about 12 times faster and usesonly 60% more space. Compared to the uncompressedscheme, in most test cases WAH is faster while still usingless space. We further verified with additional tests thatthe improvement in logical operation speed translates tosimilar improvement in query processing speed.

1. Introduction

This research was originally motivated by the needto manage the volume of data produce by a high-energy experiment called STAR1 [25, 26]. In this ex-periment, information about each potentially interest-ing collision event is recorded and multi-terabyte (1012)of data is generated each year. One important way ofaccessing the data is to have the data management

1 Information about the project is also available athttp://www.star.bnl.gov/STAR.

bitmap indexOID X =0 =1 =2 =3

1 0 1 0 0 02 1 0 1 0 03 3 0 0 0 14 2 0 0 1 05 3 0 0 0 16 3 0 0 0 17 1 0 1 0 08 3 0 0 0 1

b1 b2 b3 b4

Figure 1. A sample bitmap index.

system retrieve the events satisfying some conditionsuch as “Energy > 15 GeV and 7 <= NumParticles

< 13” [5, 25]. The physicists have identified about 500attributes that are useful for this selection process anda typical condition may involve a handful of attributes.This type of queries are known as the partial rangequeries. Since the attributes are usually read not mod-ified, the characteristics of the dataset are very sim-ilar to those of commercial data warehouses. In datawarehouse applications, one of the best known index-ing strategies for processing the partial range queries isthe bitmap index [6, 8, 21, 30]. For this reason, we haveselected to use the bitmap index for the data manage-ment software [25].

Generally, a bitmap index consists of a set ofbitmaps and queries can be answered using bit-wise logical operations on the bitmaps. Figure 1 showsa set of such bitmaps for the attribute X of a tiny ta-ble (T) consisting of only eight tuples (rows). The at-tribute X can have one of four values, 0, 1, 2 and3. There are four bitmaps each corresponding toone of the four choices. For convenience, we have la-beled the four bit sequences b1, . . . , b4. To process thequery “select * from T where X < 2,” one per-forms the bitwise logical operation b1 OR b2. Since

LBNL-49627

select * from T where X < 2

Table T

Bit-wiseOR

Page 14: Overview on Hardware Optimizations for Database Engines

14

Bit-Wise OR on Compressed Bitmaps

40000380 00000000 00000000 001FFFFFb1

40000380 8000002 001FFFFF

Literal 0 fill Literal

7FFFFFFF 7FFFFFFF 7C0001E0 3FE00000b2

WAHb1

C0000002 7C0001E0 3FE00000

1 fill Literal Literal

WAHb2

Bit-wise OR

32 bit wordsIn hex

OR OR OR OR

Logical operations (AND, OR, XOR) on two compressed bitmaps

1) Load WAH word(s)2) Calculate output (Fill-Fill,

Literal-Fill, Literal-Literal)3) Combine output

10<runlength>

11<runlength>

...

...

7FFFFFFF

00000000

Page 15: Overview on Hardware Optimizations for Database Engines

15

C-Code

 WHILE(XIDX!=XSIZE && YIDX!=YSIZE) {

//new X or Y? Calculate new fill count …

if(XisFill==1 && YisFill==1) { //2 fills

if(XfillWords<YfillWords)

min=XfillWords;

else

min=YfillWords;

writeFill(comprResultBI,&Zidx,X[Xidx]|Y[Yidx],min);

XfillWords-=min;

YfillWords-=min;

}

else if((XisFill==1 && YisFill==0) || (XisFill==0 && YisFill==1)) {

if(XisFill==1){

XfillWords--;

if((X[Xidx]&0xC0000000)==0xC0000000) writeFill(comprResultBI, &Zidx, 0xC0000000, 1);

else { comprResultBI[Zidx]=Y[Yidx]; Zidx++; }

}

if(YisFill==1){

YfillWords--;

if((Y[Yidx]&0xC0000000)==0xC0000000)

writeFill(comprResultBI, &Zidx, 0xC0000000, 1);

else {comprResultBI[Zidx]=X[Xidx]; Zidx++; }

}

}

else {

result=X[Xidx]|Y[Yidx];

if((result&0x7FFFFFFF)==0x7FFFFFFF) writeFill(comprResultBI, &Zidx, 0xC0000000, 1);

else if((result&0x7FFFFFFF)==0) writeFill(comprResultBI, &Zidx, 0x80000000, 1);

else { comprResultBI[Zidx]=X[Xidx]|Y[Yidx]; Zidx++; }

}

}

Fill-Fill

Literal-Fill

Literal-Literal

Page 16: Overview on Hardware Optimizations for Database Engines

16

Processing with PE Extension

Application specific states Preprocessing

Operation

Postprocessing

Application specific states

Initial Load Load Prepare Store Store

Memory 0

Memory 1

Memory 0

Memory 1

0000000F

00000003

40000380

80000002

001FFFFF

C0000002

7C0001E0

3FE00000

MEMORY

0

MEMORY

1

10000000..11000001..00101010..0111011.. 11000000..00101010..11000001..00110111..

10000000..11000001..00101010..01110111..

Is word fill or Literal? -> fill -> overwrite input words

11111111..11111111..11111111..11111111..

00000000..00000000..00000000..

11000000..00101010..11000001..0011011..00000000.. v 11111111... => 111111..

Write to output stream-> append or overwrite previous wordwith increased fill counter

00000000.0000000..00000..110011010..

Buffer result

11001110..

00000000..

00000000..

00000000..

MEMORY

0/1

Proceedtonext word(4x)

Align to 128-bit lines

Perform operation OR

ldXstream()

ldYstream()

4 x WAHinst()

Page 17: Overview on Hardware Optimizations for Database Engines

17

Bit-Wise OR on Compressed Bitmaps

40000380 00000000 00000000 001FFFFFb1

40000380 8000002 001FFFFF

Literal 0 fill Literal

7FFFFFFF 7FFFFFFF 7C0001E0 3FE00000b2

WAHb1

C0000002 7C0001E0 3FE00000

1 fill Literal Literal

WAHb2

Bit-wise OR

32 bit wordsIn hex

OR OR OR OR

Code with Extension

do{ ldXstream();ldYstream();WAHinst(); WAHinst(); WAHinst();

} while(WAHinst());

Page 18: Overview on Hardware Optimizations for Database Engines

18

Many More Extensions

Bitmap Compression andProcessing (AND,

OR, XOR)

Hashing Sorted Set OperationsW

AH

PLW

AH

CO

MP

AX

Has

h+

Lo

oku

p

Has

h +

Inse

rt

Has

h K

eys

Has

h S

amp

ling

Cit

yHas

h3

2

Me

rge

Sort

Inte

rse

ctio

n

Un

ion

Dif

fere

nce

Sort

-Me

rge

Join

Sort

-Me

rge

Ag

gre

gat

ion

(SU

M)

BitiX X X X

HASHI X X X X X

Titan3D X X X X X X X

Tomahawk DBA

X X X X X X

Processor

Extension

Page 19: Overview on Hardware Optimizations for Database Engines

19

Evaluation

 REFERENCE PROCESSORS

§ Tomahawk DBA Processor --> Set of different DB-Extensions for WAH-Compression, Hashing, andSortes-Set Operations

Processor DescriptionTechnology

[nm]Atotal [mm²] fMAX [GHz]

PMAX [W] @ fMAX

Tomahawkwithout DBA

Basic Xtensa LX5 without instruction set extensions, 1 LSU, 32-bit memory interface

28 15.92 0.555 0.7

Tomahawk with DBA

Set of different DB-Extensions for WAH-Compression, Hashing and Sorted-Set Operations

28 18 0.5 0.753

Intel i7-6500ULow-power Intel 2-core processor based on Skylake architecture, 4MB L3 cache

14 99* 3.1 25

Comparison

Page 20: Overview on Hardware Optimizations for Database Engines

20

Evaluation - Bitmaps

Page 21: Overview on Hardware Optimizations for Database Engines

21

Outline

 PART 2:

 INTELLIGENT DMA CONTROLLER

Page 22: Overview on Hardware Optimizations for Database Engines

22

Problem Statement

22

NoC

T2RISCCore

T2RISCCore

T2RISCCore

APP CM

Memory

MicronDDR2SDRAM

LocalMemoryLocalMemoryLocalMemory

MemoryControllerSynopsysDWC

DDR2

APPTensilica570T

CMLX4-ISA_E

LocalMemoryCache

tAN

tNMc

0 0xCCA

1 0x00B

2 0x0FA

3 0x1FD

4 0xDE1

5 0x0ED

6 0x00E

7 0xD0A

tNAtMcN

tMcM

tMMc

tAPP

Problem:Many round-trips for key lookups

Approach:“Teach B-trees to the memory controller“

Page 23: Overview on Hardware Optimizations for Database Engines

23

Intelligent Main Memory Controller (iDMA)

23

NoC

Core Core Core

PointerChaser

APP CM

Memory

MicronDDR2SDRAM

LocalMemoryLocalMemoryLocalMemory

MemoryController

APP CM

MemoryControllerSynopsis

LocalMemoryCache

MemoryControllerSynopsysDWC

DDR2

0 0xCC6

1 0x000

2 0x0F0

3 0x1FD

4 0xDE1

5 0x0ED

6 0x00E

7 0xD0A

tCNtNC tMcM

tMMc

tNP

tPN

tPMc

tMcP

Vision (and first simulations)• Intelligent memory controller• Is aware of the semantics of

memory layout• Implements core operations (e.g. lookup)

Implementation (no yet in silicon)• 0,183mm² PE with 200Mhz

Page 24: Overview on Hardware Optimizations for Database Engines

24

First iDMA Design

Page 25: Overview on Hardware Optimizations for Database Engines

25

Evaluation using Simulator

Page 26: Overview on Hardware Optimizations for Database Engines

26

Summary

 HARDWARE FOUNDATION

 EXTENSIONS FOR PROCESSING ELEMENTS

 INTELLIGENT DMA CONTROLLER

Page 27: Overview on Hardware Optimizations for Database Engines

Overview on Hardware Optimizations for Database EnginesAnnett Ungethüm, Dirk Habich, Tomas Karnagel, Sebastian Haas, Eric Mier, Gerhard Fettweis, Wolfgang Lehner

BTW 2017, Stuttgart, Germany, 2017-03-09