overview on hardware optimizations for database engines
TRANSCRIPT
Overview on Hardware Optimizations for Database EnginesAnnett Ungethüm, Dirk Habich, Tomas Karnagel, Sebastian Haas, Eric Mier, Gerhard Fettweis, Wolfgang Lehner
BTW 2017, Stuttgart, Germany, 2017-03-09
2
Interaction DB-Engine and Hardware
Applications/Database Engines
Modern Hardware
Well-Known Challenge:Exploit hardware technology by specific data management techniques (indexing, data storage, query & transaction processing)
1970 1980 1990 2000 2010 2020
10
100
1000
10000
1e+05
1e+06
1e+07 memory (KByte)
1970 1980 1990 2000 2010 2020
0
2
4
6
8
10 #cores
Main Memory CPU
3
Era of Dark Silicon
MOORE‘S LAW
§ Number of transistors in a dense integrated circuit doubles approximately every two years.
DARK SILICON
§ We can no longer power the transistors that Moore is giving us
1970 1980 1990 2000 2010 2020
110
1001000
100001e+051e+061e+07 #transistors (x1000)
process (nm)
http://engineering.nyu.edu/garg/node/31
4
HW/SW Co-Design for DB-Engines
Applications/Database Engines
Modern Hardware
Challenge:HW/SW Co-Design for Database EnginesSpecialization of Hardware to overcome Dark Silicon
5
Outline
HARDWARE FOUNDATION
EXTENSIONS FOR PROCESSING ELEMENTS
INTELLIGENT DMA CONTROLLER
6
Hardware Foundation
TOMAHAWK PLATFORM
7
Hardware Foundation – Zoom In
8
Hardware Foundation – Zoom In (2)
CORE MANAGER (CM)
§ Extended Xtensa-LX5 from Tensilica (now Cadence)
§ 32KB for code§ 64KB for data
PROCESSING ELEMENTS (PE)
§ Xtensa-LX5 from Tensilica (now Cadence)§ 32KB for code
§ 2x32KB for data on PE
APPLICATION CORE (APP)
§ 570T core from Tensilica (now Cadence)
Control-Plane
Control-Plane
9
Outline
Control-Plane
Control-Plane
PART I:
EXTENSIONS OF PROCESSING ELEMENTS
10
Development Flow
DEVELOPMENT OF INSTRUCTION SET EXTENSIONS WITH
TENSILICA TOOLS
§ Tensilica Instruction Extension (TIE) language
§ C/TIE compiler§ Cycle accurate simulator/debugger
§ Processor generator
SYNTHESIS OF RTL CODE
§ Synopsys Design Compiler, PrimeTime PX
§ TSMC CMOS LP 65nm libraries
int res= (v0 + v1 + v2) >> shift8;
// shift8 -> internal stateint res=add3_shift(v0, v1, v2);
11
Investigated Database Primitives
Bitmap Compression and Processing (AND,
OR, XOR)
Hashing Sorted Set Operations
WA
H
PLW
AH
CO
MP
AX
Has
h+
Lo
oku
p
Has
h +
Inse
rt
Has
h K
eys
Has
h S
amp
ling
Cit
yHas
h3
2
Me
rge
So
rt
Inte
rse
ctio
n
Un
ion
Dif
fere
nce
Sort
-Me
rge
Jo
in
Sort
-Me
rge
A
gg
reg
atio
n (
SUM
)
Primivites
2014
12
Basic RISC Instruction Set
Application-Specific Instruction Set
Instruction Set
Application-Specific States
Application-Specific Registers
Basic Registers
Register Files
Instructionfetch
Load-Store Unit 0
Load-Store Unit 1
Data Prefetcher
Inte
rcon
ne
ct
Local InstructionMemory
Local Data Memory 0
Local Data Memory 1
Extended Tensilica LX5 Processor
64 bit
128 bit
128 bit
General Approach for all Extensions
13
Bitmap Primitives
BITMAPS ARE A SPECIAL KIND OF INDEX BITMAPS COMPRESSION
§ bit length equals number of tuples
WORD-ALIGNED HYBRID (WAH) CODE
§ Stateless compression§ Run-length-encoding (RLE)
- run of 0‘s and 1‘s
§ WAH bitmaps contain RLE- compressed fills and
- uncompressed literals
Compressing Bitmap Indexes for Faster Search Operations
Kesheng Wu, Ekow J. Otoo and Arie ShoshaniLawrence Berkeley National Laboratory
Berkeley, CA 94720, USAEmail: {kwu, ejotoo, ashoshani}@lbl.gov
Abstract
In this paper, we study the effects of compression onbitmap indexes. The main operations on the bitmaps dur-ing query processing are bitwise logical operations such asAND,OR,NOT, etc.Using the general purpose compres-sion schemes, such as gzip, the logical operations on thecompressed bitmaps are much slower than on the uncom-pressed bitmaps. Specialized compression schemes, likethe byte-aligned bitmap code (BBC), are usually fasterin performing logical operations than the general purposeschemes, but in many cases they are still orders of magni-tude slower than the uncompressed scheme. To make thecompressed bitmap indexes operate more efficiently, wedesigned a CPU-friendly scheme which we refer to as theword-aligned hybrid code (WAH). Tests on both syntheticand real application data show that the new scheme sig-nificantly outperforms well-known compression schemesat a modest increase in storage space. Compared to BBC,a scheme well-known for its operational efficiency, WAHperforms logical operations about 12 times faster and usesonly 60% more space. Compared to the uncompressedscheme, in most test cases WAH is faster while still usingless space. We further verified with additional tests thatthe improvement in logical operation speed translates tosimilar improvement in query processing speed.
1. Introduction
This research was originally motivated by the needto manage the volume of data produce by a high-energy experiment called STAR1 [25, 26]. In this ex-periment, information about each potentially interest-ing collision event is recorded and multi-terabyte (1012)of data is generated each year. One important way ofaccessing the data is to have the data management
1 Information about the project is also available athttp://www.star.bnl.gov/STAR.
bitmap indexOID X =0 =1 =2 =3
1 0 1 0 0 02 1 0 1 0 03 3 0 0 0 14 2 0 0 1 05 3 0 0 0 16 3 0 0 0 17 1 0 1 0 08 3 0 0 0 1
b1 b2 b3 b4
Figure 1. A sample bitmap index.
system retrieve the events satisfying some conditionsuch as “Energy > 15 GeV and 7 <= NumParticles
< 13” [5, 25]. The physicists have identified about 500attributes that are useful for this selection process anda typical condition may involve a handful of attributes.This type of queries are known as the partial rangequeries. Since the attributes are usually read not mod-ified, the characteristics of the dataset are very sim-ilar to those of commercial data warehouses. In datawarehouse applications, one of the best known index-ing strategies for processing the partial range queries isthe bitmap index [6, 8, 21, 30]. For this reason, we haveselected to use the bitmap index for the data manage-ment software [25].
Generally, a bitmap index consists of a set ofbitmaps and queries can be answered using bit-wise logical operations on the bitmaps. Figure 1 showsa set of such bitmaps for the attribute X of a tiny ta-ble (T) consisting of only eight tuples (rows). The at-tribute X can have one of four values, 0, 1, 2 and3. There are four bitmaps each corresponding toone of the four choices. For convenience, we have la-beled the four bit sequences b1, . . . , b4. To process thequery “select * from T where X < 2,” one per-forms the bitwise logical operation b1 OR b2. Since
LBNL-49627
select * from T where X < 2
Table T
Bit-wiseOR
14
Bit-Wise OR on Compressed Bitmaps
40000380 00000000 00000000 001FFFFFb1
40000380 8000002 001FFFFF
Literal 0 fill Literal
7FFFFFFF 7FFFFFFF 7C0001E0 3FE00000b2
WAHb1
C0000002 7C0001E0 3FE00000
1 fill Literal Literal
WAHb2
Bit-wise OR
32 bit wordsIn hex
OR OR OR OR
Logical operations (AND, OR, XOR) on two compressed bitmaps
1) Load WAH word(s)2) Calculate output (Fill-Fill,
Literal-Fill, Literal-Literal)3) Combine output
10<runlength>
11<runlength>
...
...
7FFFFFFF
00000000
15
C-Code
WHILE(XIDX!=XSIZE && YIDX!=YSIZE) {
//new X or Y? Calculate new fill count …
if(XisFill==1 && YisFill==1) { //2 fills
if(XfillWords<YfillWords)
min=XfillWords;
else
min=YfillWords;
writeFill(comprResultBI,&Zidx,X[Xidx]|Y[Yidx],min);
XfillWords-=min;
YfillWords-=min;
}
else if((XisFill==1 && YisFill==0) || (XisFill==0 && YisFill==1)) {
if(XisFill==1){
XfillWords--;
if((X[Xidx]&0xC0000000)==0xC0000000) writeFill(comprResultBI, &Zidx, 0xC0000000, 1);
else { comprResultBI[Zidx]=Y[Yidx]; Zidx++; }
}
if(YisFill==1){
YfillWords--;
if((Y[Yidx]&0xC0000000)==0xC0000000)
writeFill(comprResultBI, &Zidx, 0xC0000000, 1);
else {comprResultBI[Zidx]=X[Xidx]; Zidx++; }
}
}
else {
result=X[Xidx]|Y[Yidx];
if((result&0x7FFFFFFF)==0x7FFFFFFF) writeFill(comprResultBI, &Zidx, 0xC0000000, 1);
else if((result&0x7FFFFFFF)==0) writeFill(comprResultBI, &Zidx, 0x80000000, 1);
else { comprResultBI[Zidx]=X[Xidx]|Y[Yidx]; Zidx++; }
}
}
Fill-Fill
Literal-Fill
Literal-Literal
16
Processing with PE Extension
Application specific states Preprocessing
Operation
Postprocessing
Application specific states
Initial Load Load Prepare Store Store
Memory 0
Memory 1
Memory 0
Memory 1
0000000F
00000003
40000380
80000002
001FFFFF
C0000002
7C0001E0
3FE00000
MEMORY
0
MEMORY
1
10000000..11000001..00101010..0111011.. 11000000..00101010..11000001..00110111..
10000000..11000001..00101010..01110111..
Is word fill or Literal? -> fill -> overwrite input words
11111111..11111111..11111111..11111111..
00000000..00000000..00000000..
11000000..00101010..11000001..0011011..00000000.. v 11111111... => 111111..
Write to output stream-> append or overwrite previous wordwith increased fill counter
00000000.0000000..00000..110011010..
Buffer result
11001110..
00000000..
00000000..
00000000..
MEMORY
0/1
Proceedtonext word(4x)
Align to 128-bit lines
Perform operation OR
ldXstream()
ldYstream()
4 x WAHinst()
17
Bit-Wise OR on Compressed Bitmaps
40000380 00000000 00000000 001FFFFFb1
40000380 8000002 001FFFFF
Literal 0 fill Literal
7FFFFFFF 7FFFFFFF 7C0001E0 3FE00000b2
WAHb1
C0000002 7C0001E0 3FE00000
1 fill Literal Literal
WAHb2
Bit-wise OR
32 bit wordsIn hex
OR OR OR OR
Code with Extension
do{ ldXstream();ldYstream();WAHinst(); WAHinst(); WAHinst();
} while(WAHinst());
18
Many More Extensions
Bitmap Compression andProcessing (AND,
OR, XOR)
Hashing Sorted Set OperationsW
AH
PLW
AH
CO
MP
AX
Has
h+
Lo
oku
p
Has
h +
Inse
rt
Has
h K
eys
Has
h S
amp
ling
Cit
yHas
h3
2
Me
rge
Sort
Inte
rse
ctio
n
Un
ion
Dif
fere
nce
Sort
-Me
rge
Join
Sort
-Me
rge
Ag
gre
gat
ion
(SU
M)
BitiX X X X
HASHI X X X X X
Titan3D X X X X X X X
Tomahawk DBA
X X X X X X
Processor
Extension
19
Evaluation
REFERENCE PROCESSORS
§ Tomahawk DBA Processor --> Set of different DB-Extensions for WAH-Compression, Hashing, andSortes-Set Operations
Processor DescriptionTechnology
[nm]Atotal [mm²] fMAX [GHz]
PMAX [W] @ fMAX
Tomahawkwithout DBA
Basic Xtensa LX5 without instruction set extensions, 1 LSU, 32-bit memory interface
28 15.92 0.555 0.7
Tomahawk with DBA
Set of different DB-Extensions for WAH-Compression, Hashing and Sorted-Set Operations
28 18 0.5 0.753
Intel i7-6500ULow-power Intel 2-core processor based on Skylake architecture, 4MB L3 cache
14 99* 3.1 25
Comparison
20
Evaluation - Bitmaps
21
Outline
PART 2:
INTELLIGENT DMA CONTROLLER
22
Problem Statement
22
NoC
T2RISCCore
T2RISCCore
T2RISCCore
APP CM
Memory
MicronDDR2SDRAM
LocalMemoryLocalMemoryLocalMemory
MemoryControllerSynopsysDWC
DDR2
APPTensilica570T
CMLX4-ISA_E
LocalMemoryCache
tAN
tNMc
0 0xCCA
1 0x00B
2 0x0FA
3 0x1FD
4 0xDE1
5 0x0ED
6 0x00E
7 0xD0A
tNAtMcN
tMcM
tMMc
tAPP
Problem:Many round-trips for key lookups
Approach:“Teach B-trees to the memory controller“
23
Intelligent Main Memory Controller (iDMA)
23
NoC
Core Core Core
PointerChaser
APP CM
Memory
MicronDDR2SDRAM
LocalMemoryLocalMemoryLocalMemory
MemoryController
APP CM
MemoryControllerSynopsis
LocalMemoryCache
MemoryControllerSynopsysDWC
DDR2
0 0xCC6
1 0x000
2 0x0F0
3 0x1FD
4 0xDE1
5 0x0ED
6 0x00E
7 0xD0A
tCNtNC tMcM
tMMc
tNP
tPN
tPMc
tMcP
Vision (and first simulations)• Intelligent memory controller• Is aware of the semantics of
memory layout• Implements core operations (e.g. lookup)
Implementation (no yet in silicon)• 0,183mm² PE with 200Mhz
24
First iDMA Design
25
Evaluation using Simulator
26
Summary
HARDWARE FOUNDATION
EXTENSIONS FOR PROCESSING ELEMENTS
INTELLIGENT DMA CONTROLLER
Overview on Hardware Optimizations for Database EnginesAnnett Ungethüm, Dirk Habich, Tomas Karnagel, Sebastian Haas, Eric Mier, Gerhard Fettweis, Wolfgang Lehner
BTW 2017, Stuttgart, Germany, 2017-03-09