dutch-belgium database day university of antwerp, 2004.12.03 monetdb/x100 peter boncz, marcin...

Dutch-Belgium DataBase DayUniversity of Antwerp, 2004.12.03

MonetDB/x100

Peter Boncz, Marcin Zukowski, Niels Nes

Introduction

What is x100 ?

A new query processing engine developed for MonetDB

Contents

IntroductionCWI Database GroupMotivation

MonetDB/x100 Architecture HighlightsOptimizing CPU performanceExploiting cache memoriesEnhancing disk bandwidth

ConclusionsDiscussion

CWI Database Group

Database ArchitectureDBMS design, implementation, evaluationWide area; many sub-areas

Data structuresQuery processing algorithmsModern computer architectures

MonetDB1994-2004 at CWIopen-source high-performance DBMSFuture: X100, MonetDB 5.0

Motivation

Multimedia retrievalTREC Video: 130 hours of news, growing

each yearTask: search for a given text (speech

recognition) or video similar to a given image

3 TB of data (!)

Motivation

Similar areasData-miningOLAP, data warehousingScientific applications (astronomy,

biology…)

Challenge: process really large datasets within DBMS efficiently

x100 Highlights

Use computer architecture to guide this talk

CPU

Actual data processing

CPU

From CISC to hyper-pipelined 1986: 8086: CISC 1990: 486: 2 execution units 1992: Pentium: 2 x 5-stage pipelined units 1996: Pentium3: 3 x 7-stage pipelined units 2000: Pentium4: 12 x 20-stage pipelined execution units

Each instruction executes in multiple steps… A -> A1, …, An

… in (multiple) pipelines:

CPU

But only, if the instructions are independent! Otherwise:

Problems:branches in program logicaccessing recently modified memory

[ailamaki99, …] DBMSs bad at filling pipelines

x100: vectorized processing

*(int,int): int *(int[],int[]) : int[]

x100: vectorized processing

Primitives: vector at a time very basic functionality independent loop iterations simple code

Optimization levels: Compiler loop pipelining CPU full pipelines

*(int,int): int *(int[],int[]) : int[]

x100: results (TPC-H Q1)

Few CPU cycles per tuplee.g. MySQL spends ~100 cycles for such

operators

Main memory

Large, but not unlimited

Cache

Faster, but very limited storage

Cache Memory Bottleneck

Cache to hide memory access costDifferent costs at different levels:

L1 cache access: 1-2 cyclesL2 cache access: 6-20 cyclesmain-memory access: 100-400 cycles

Consequences: random access into main-memory very

expensiveDBMS must buffer for CPU cache, not RAM

Cache Memory Bottleneck

Cache to hide memory access costDifferent costs at different levels:

L1 cache access: 1-2 cyclesL2 cache access: 6-20 cyclesmain-memory access: 100-400 cycles

Consequences : random access into main-memory very

expensiveDBMS must buffer for CPU cache, not RAM

cache-conscious query processing MonetDB research [VLDB99,00,02,04]

x100: pipelining

Vectors fill the CPU cache

main-memory access only at the data sources and sinks-

*

+

Project( )

0.19

-

X100 query processor

CPU Cache

RAM

X100 buffer mgr

disk

MonetDB uses much more main memory bandwidth

x100: pipelining

Vectors fill the CPU cache

main-memory access only at the data input and output-

*

+

Project( )

0.19

-


CPU Cache

RAM

X100 buffer mgr

disk

x100

MonetDB

Disk

Slow, but unlimited () storage

Disk

Random access hopelessSize grows faster than bandwidth

x100: problem - bandwidth

MonetDB/x100 too fast for disksTPC-H queries need 200-600MB/s

Bandwidth improvements

Three ideas:Vertical Fragmentation (MonetDB) new: Lightweight Compression new: Cooperative Scans

Vertical fragmentation

DBMS disk access in data-intensive applications

Only the relevant data is read – reduced disk bandwidth requirements

Lightweight Compression

Compression introduced not to reduce storage space but to increase disk bandwidth:Due to efficient code for disk-based data

only few percents of CPU time are usedPart of this extra time can be spent on

decompressing data


Rationale:- Disk RAM transfer

uses DMA and does not need CPU

- (de)compress only vector-at-a-time when data is needed

-

*

+

Project( )

0.19

-


CPU Cache

RAM

X100 buffer mgr

disk

Compress on the CPU cache RAM boundary


Standard compression won’t doCompresses too well => too slow (100MB/s)

Research Questiondevise lightweight (de)compression

algorithms

Results so farcompression factor relatively small, up to 3.5decompression speed – 3GB/sec (!)compression speed – 1GB/sec (!!!)perceived bandwidth 3 times bigger

Cooperative Scans

Idea: use I/O bandwidth to satisfy multiple queries Cooperative Scans

Active Buffer Manager, is aware of concurrent scans on the same table

Research Question: devise adaptive buffer management strategies

Benefits: I/O Bandwidth is re-used by multiple queries Concurrent queries don’t fight anymore for the disk arm

Cooperative Scans

x100 and Cooperative Scans:>30 queries without performance

degradation

x100 summary

Original MonetDB successful in the same application areas, howeverSub-optimal CPU utilizationOnly efficient if problem fits RAM

x100 improves architecture on all levelsBetter CPU utilizationBetter cache utilizationScales to non-memory resident datasetsImproves I/O bandwidth using compression

and cooperative scans

Example results

Performance close to hand-written C functions

TPCH SF-1 x100 Oracle MonetDB

Q1 0.54s 30s 9.4s

Q3 0.24s 10s 2.5s

Q6 0.15s 1.5s 2.5s

Q14 0.13s 2s 1.2s

x100 status

First proof-of-concept implemented

Full TPC-H benchmark executesFuture work:

lots of engineeringnew buffer manager more vectorized algorithmsmemory footprint tuning (for small devices)SQL front-end

More information

www.cwi.nl/~boncz/x100.htmlCIDR’05 paper:

“MonetDB/X100: Hyper-pipelining query execution”

Discussion

?

dutch-belgium database day university of antwerp, 2004.12.03 monetdb/x100 peter boncz, marcin...

Documents

int slide

cpu cache mainmemory

cache memory bottleneck

monetdb slide

ram slide

main memory bandwidth

l1 cache access

cycles mainmemory access