building petabytewarehouses with unmodified postgresql 200… · unmodified postgresql emmanuel...

18
Building PetaByte Warehouses with Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st , 2009 Topics Introduction to frontline data warehouses PetaByte warehouse design with PostgreSQL Aster contributions to PostgreSQL Q&A 2 PGCon 2009, Ottawa © 2009 Aster Data Systems

Upload: others

Post on 25-Jul-2020

24 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

Building PetaByte Warehouses with Unmodified PostgreSQL

Emmanuel Cecchet

Member of Research Staff

May 21st, 2009

Topics

Introduction to frontline data warehouses

PetaByte warehouse design with PostgreSQL

Aster contributions to PostgreSQL

Q&A

2 PGCon 2009, Ottawa © 2009 Aster Data Systems

Page 2: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

Enterprise Data Warehouse Under Stress

EnterpriseData Warehouse

EnterpriseData Warehouse

PGCon 2009, Ottawa © 2009 Aster Data Systems 3

Offloading The Enterprise Data Warehouse

Enterprise Data Warehouse

Enterprise Data Warehouse

Frontline Data Warehouse

Frontline Data Warehouse

Archival Data Warehouse

Archival Data Warehouse

• Petabytes detailed data• Low cost/TB• Flexible compression• On-line access• Aging out to off-line

• Petabytes source data• 24 x7 Availability• TB/Day Load capacity• In-database transforms• Rapid data access

PGCon 2009, Ottawa © 2009 Aster Data Systems 4

Page 3: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

�� �� ��� ���������� ���� ����� ������������ ��� ������� ���������������� ����!����� ��� "�����!��� ����� �� ��� #$%������ �� ��������� ��&� ���'�������� ������� ��&� ���($) *�+ �,�� � $���+ -.�������� �� � ��/����� 0�1 �� ��� 2�� �������Requirements for Frontline Data Warehouses

5 PGCon 2009, Ottawa © 2009 Aster Data Systems

Who is Aster Data Systems?

Aster nCluster is a software-only RDBMS for large-scale frontline data warehousing

High performance “Always Parallel” MPP architecture

High availability “Always On” on-line operations

High value analytics In-Database MapReduce

Low cost Petabytes on commodity HW

6 PGCon 2009, Ottawa © 2009 Aster Data Systems

Page 4: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

� ��������� � �� � �� �� � ��� � ��� �������� �� ���0��� ����� !"#!�$%�&�' �����()*+, ������� !"#!�$%�&�' �����()*+, �� -./ %%�� 01 * 2%31 !+�', #3�-./ %%�� 01 * 2%31 !+�', #3�MySpace Frontline Data Warehouse4 5677689 :;<9=> ? @ABC5 D<E FGHIJKLMKNJNO PJQRST UVWXYZ[4 5677689 :;<9=> ? @ABC5 D<E FGHIJKLMKNJNO PJQRST UVWXYZ[\�] _̂ `a �̂ bc_c dc �� e] f��ghhi jklmno pqhh rs tuvutwrx7 PGCon 2009, Ottawa © 2009 Aster Data Systems

Topics

Introduction to frontline data warehouses

PetaByte warehouse design with PostgreSQL

Aster contributions to PostgreSQL

Q&A

8 PGCon 2009, Ottawa © 2009 Aster Data Systems

Page 5: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

Petabyte Datawarehouse Design

PostgreSQL as a building blockDo not hack PostgreSQL to serve the distributed database

Build on top of mainline PostgreSQL

Use standard Postgres APIs

Service Oriented ArchitectureHierarchical query planning and optimization

Shared nothing replication treating Postgres and Linux as a

service

Compression at the OS level transparently to PostgreSQL

In-database Map-Reduce out-of-process

9 PGCon 2009, Ottawa © 2009 Aster Data Systems

Aster nCluster Database

10 PGCon 2009, Ottawa © 2009 Aster Data Systems

Queries/AnswersQueries/Answers

Queen Server Group

QueriesQueries

DataData

Worker Server Group

Loader/Exporter Server Group

Aster nCluster Database

Queen Node�QKQ��O O�OL��� �JK�M�N�QLMJKO� O���Q�QKR ���J� QKRMK�IJJ�RMKQL�O �N��M�OWorker Node� ������ ������� ������ ������� ��������������� �������� ��!���������"#����� ����������� ���� ��� ��������� $��%����� ��� ��%�� &��'���Loader Node(����)� ��$ ���� )�� *+, �� -� ' .�����/�������� �%�� ���� ���� ����������� ��01����2�����3��� �%� ��01����� ���� ������ &��'���

Page 6: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

Query Processing: How It Works

A: Users send queries to the Queen (via ACT, ODBC/JDBC, BI tools, etc.)

B: The Queen parses the query, creates an optimal distributed plan, sends subqueriesto the Workers, and supervises the processing

C: Workers execute their subqueries (locally or via distributed communication)

D: The Queen aggregates Worker results and returns the final result to the user

��C � �%&��&5� C887>�F5���F5� �%&��& �%&��& �%&��&….

�� ������ �� ��� � � �� ��#, 32� ��11 PGCon 2009, Ottawa © 2009 Aster Data Systems

Table Compression Architecture

How It Works���� �� �� !"#��#$ % $#�� !"#��#$ &#'�( '�)���' $���&��#*��"#$ �+ $�,,#"#+� ��&'#�!��#� $#!#+$�+) �+ �� !"#����+ '#-#'Architecture Benefits: .�)/ �� !"#����+ "����� 0&�))#" &'��1 ��2#� 3�#'$ �"# ���� ��-�+)�4����&��# �"�+�!�"#+� 0#��# �, ,5�5"# 6���)"#� 5!)"�$#�47�+�5""#+�3 0/�)/8!#",�" �+�# 5'��8��&'# �� !"#����+46#",�" �+�#9 :;#(8"�(< =5#"�#� $�+>� +##$ ,5'' ��&'# $#�� !"#����+

Logical Database (eg. Schema)

Physical Tablespaces?@ABCDEEFGDHIJA?@ABCDEEFGDHIJA?@ABCDEEFGDHIJA?@ABCDEEFGDHIJA?@ABCDEEFGDHIJA?@ABCDEEFGDHIJA ?@ABCDEEF KILM?@ABCDEEF KILM ?@ABCDEEFN@O?@ABCDEEFN@O?@ABCDEEFN@O?@ABCDEEFN@O?@ABCDEEF KILM?@ABCDEEF KILM?@ABCDEEF KILM?@ABCDEEF KILM?@ABCDEEF KILM?@ABCDEEF KILM?@ABCDEEF KILM?@ABCDEEF KILM?@ABCDEEF KILM?@ABCDEEF KILM?@ABCDEEF KILM?@ABCDEEF KILM P@Q?@ABCDEEDHP@Q?@ABCDEEDH ?@ABCDEEF KILM?@ABCDEEF KILM?@ABCDEEF KILM?@ABCDEEF KILM?@ABCDEEF KILM?@ABCDEEF KILMRS TUVWXX YWZRS TUVWXXQuery?@ABCDEEFGDHIJA?@ABCDEEFGDHIJA?@ABCDEEFGDHIJA?@ABCDEEFGDHIJA ?@ABCDEEF KILM?@ABCDEEF KILM?@ABCDEEF KILM?@ABCDEEF KILM

12 PGCon 2009, Ottawa © 2009 Aster Data Systems

Page 7: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

Table Compression Enables Powerful Archival

Compressed - High

Compressed - Medium

Compressed - Low

Not Compressed

Older data accessed less frequentlyCompress to save space and costOldest data is compressed the most, recent is

compressed the leastCompressed tables are fully available for

queries (true online archival )

13 PGCon 2009, Ottawa © 2009 Aster Data Systems

What is “MapReduce” and Why Should I Care?

What is MapReduce?

Popularized by Google

http://labs.google.com/papers/mapreduce.html

Processes data in parallel across distributed cluster

Why is MapReduce significant?

Empowers ordinary developers

Write application logic, not debug cluster communication code

Why is In-Database MapReduce significant?

Unites MapReduce with SQL: power invoked from SQL

Develop SQL/MR functions with common languages

14 PGCon 2009, Ottawa © 2009 Aster Data Systems

Page 8: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

Aster In-Database MapReduce

Slide 15

Aster nClusterAster nCluster

Data Store Engine

In-Database MapReduce

SQL/MR Functions

PGCon 2009, Ottawa © 2009 Aster Data Systems 15

Users, analysts, applications

In-Database MapReduce

Extensible framework (MapReduce + SQL)

•Flexible: Map-Reduce expressiveness, languages, polymorphism

•Performance: Massively parallel, computational push-down

•Availability: Fault isolation, resource management

Out-of-process executables

•Does not use PL/* for custom code execution

•Can execute Map and Reduce functions in any language that has a runtime on Linux (e.g. Java, Python, C#, C/C++, Ruby, Perl, R,

etc…)

•Standard PostgreSQL APIs to send/receive data to executables

•Fault isolation, security and resource management for arbitrary user code

PGCon 2009, Ottawa © 2009 Aster Data Systems 16

Page 9: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

Live QueriesLive Queries

Always On: Minimize Planned Downtime

Backup &Restore

Backup &Restore

DataBackup

DataBackup

Add CapacityAdd CapacityRebalance DataRebalance Data

Load & ExportLoad & Export

17 PGCon 2009, Ottawa © 2009 Aster Data Systems

Precision Scaling

When more CPU/memory/capacity are needed, new nodes can be added for scale-out.

Precision Scaling uses standard PostgreSQL APIs to migrate vWorker partitions to new nodes either for load balancing (more compute power) or capacity expansion

Example: Assume Workers 1/2/3 are 100% CPU-bottlenecked. Incorporation adds a new Worker4 node and migrates over vWorker partitions D/H/L. As a result of load-balancing, CPU-utilization drops to 75% per node, eliminating hotspots.

��[��[S ��[��[� ��[��[�����W � ��V ����� �� ����� �� � � ��[��[���� �� ����� �� � � ….

��� �� � ���� �� � ���� �� ����� �� � � ��� �� � ���� �� ����� �� � ���� �� � � ��� �� � ���� �� � ���� �� � ���� �� � ���� �� � ���� �� � �18 PGCon 2009, Ottawa © 2009 Aster Data Systems

Page 10: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

Replication���������� ���0���� ��� ��"� �� ���� �������� ��� ��� ���0����"��� �1Q� �Q�LMLMJK � �J����O� MO ���M�QL�R QL �QOL JK�� JK QKJL�� KJR��K L� � �KL J� KJR� �QMN��� ���M�QLMJK ��JL��LO Q�QMKOL RQLQ JOO QKR O�OL��JNLQ��O���� ��� ��� ������ ��� ����� ��� � ���0������QKO�Q��KL LJ �JOL���O�M� J� ��P QKR �QLQ OM��MK�PMKN� �� �Q�MMLM�O LJ �Q�LN�� RQLQ �QK��O19 PGCon 2009, Ottawa © 2009 Aster Data Systems

Fault Tolerance & Automatic Online Failover

Replication Failover

Automatic, non-disruptive, graceful performance impact

Replication Restoration

Delta-based (fast) and online (non-disruptive)

� !"#!$ � !"#!% � !"#!&'(##) *+, -./012��� �� ����� �� � ���� �� � �3��� �� � �3 ��� �� ��3��� �� ����� �� � ���� �� ��3 ��� �� � �3��� �� ��3��� �� � ���� �� � � � !"#!4 � !"#!5��� �� � �3��� �� � �3��� �� ����� �� � � ��� �� � �3��� �� ��3��� �� � ���� �� ����� �� ��3 ��� �� � �320 PGCon 2009, Ottawa © 2009 Aster Data Systems

Page 11: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

Using Commodity Hardware

….

….….

������V � �� ��[����M�Q�ML L��K�L �����IMO�J� NKM���� �L��������V � ��[��[X���NLM�IJ�� I� O � �NLM�� I� O ��� O�� ����MOL�M�NL�R ���J�� ���� ���� �������������� �MO� ��M �O � Q��MK� ���� �Q�Q�ML������� ��� ���� �NK� �L������W ����X ����[�VW� ����[��[ ����X ����[�������[ ����X ��[�WX��[���� ��W��V X����QQ�� ����J��QK����M�� �QKQ��Q�����M�� � QMQ����PJ��OL �I�

21 PGCon 2009, Ottawa © 2009 Aster Data Systems

Scaling OnScaling On--Demand to a Demand to a PetaBytePetaByte

Commodity Hardware

2 TB Building Block

Dell, HP, IBM, Intel x86

16 GB Memory

2.4TB of Storage

8 Disks

$5k to $10k Node�� ��7!>=<E "8#<More Blocks = More Power

Massive Power Per Rack

160 Cores

640 GB RAM

48 TB SAS

22 PGCon 2009, Ottawa © 2009 Aster Data Systems

Page 12: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

Heterogeneous Hardware Support

Heterogeneous HW support enables customers to add cheaper/faster nodes to existing systems for investment protection

Mix-n-match different servers as you grow (faster CPUs, more memory, bigger disk drives, different vendors, etc)

��[��[ ��[��[ ��[��[ ��[��[...

����W � ��V ��...

�8E�<E....

CPU

Memory

Disk

23 PGCon 2009, Ottawa © 2009 Aster Data Systems

Topics

Introduction to frontline data warehouses

PetaByte warehouse design with PostgreSQL

Aster contributions to PostgreSQL

Q&A

24 PGCon 2009, Ottawa © 2009 Aster Data Systems

Page 13: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

Error Logging in COPY

Separate good from malformed tuples

Per tuple error information

Errors to capture

• Type mismatch (e.g. text vs. int)

• Check constraints (e.g. int < 0)

• Malformed chars (e.g. invalid UTF-8 seq.)

• Missing / extra column

Low-performance overhead

Activated on-demand using environment variables

25 PGCon 2009, Ottawa © 2009 Aster Data Systems

Error Logging in COPY

26 PGCon 2009, Ottawa © 2009 Aster Data Systems

Detailed error context is logged along with tuple content

Page 14: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

Error Logging Performance

• 1 million tuples COPY performance

27 PGCon 2009, Ottawa © 2009 Aster Data Systems

Auto-partitioning in COPY

COPY into a parent table route tuples

directly in the child table with matching

constraints

COPY y2008 FROM ‘data.txt’

PGCon 2009, Ottawa © 2009 Aster Data Systems 28

Page 15: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

Auto-partitioning in loading

Activated on-demand

• set tuple_routing_in_copy = 0/1;

Configurable LRU cache size

• set tuple_routing_cache_size = 3;

COPY performance into parent table similar to

direct COPY in child table if data is sorted

Will leverage partitioning information in the

future (WIP for 8.5)

PGCon 2009, Ottawa © 2009 Aster Data Systems 29

Other contributions

Temporary tables and 2PC transactions

[Auto-]Partitioning infrastructure

Regression test suite

LFI (http://lfi.sourceforge.net/)

•Fault injection at the library level or below

•out-of-memory conditions, network connection errors, interrupted system calls, data corruption, hardware failures, etc…

•Lightning talk on Friday!

PGCon 2009, Ottawa © 2009 Aster Data Systems 30

Page 16: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

Topics

Introduction to Aster and data warehousing

PetaByte warehouse design with PostgreSQL

Aster contributions to PostgreSQL

Q&A

31 PGCon 2009, Ottawa © 2009 Aster Data Systems

PetaByte Warehouses with PostgreSQL

Unmodified PostgreSQL

“Always Parallel” MPP architecture

“Always On” on-line operations

In-Database MapReduce

PetaByte on commodity Hardware

32 PGCon 2009, Ottawa © 2009 Aster Data Systems

Page 17: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

Learn more

www.asterdata.com

Free TDWI report on advanced analytics:

• asterdata.com/mapreduce

Free Gartner webcast on mission-critical DW:

• asterdata.com/gartner

Contact us

[email protected]

Aster Data Systems

33 PGCon 2009, Ottawa © 2009 Aster Data Systems

Bonus slides

34 PGCon 2009, Ottawa © 2009 Aster Data Systems

Page 18: Building PetaByteWarehouses with Unmodified PostgreSQL 200… · Unmodified PostgreSQL Emmanuel Cecchet Member of Research Staff May 21 st, 2009 Topics Introduction to frontline data

nCluster Components

������ ���������� ��������������� ��� ������1 �/���&'5� C887>� FG=G �6969�DE8�EG�>������������ �� ��� / � � ���������!+, + "&�)+&+, #%!!+, + "&�)+&+, #%! # )%&, # )%&,

nCluster

� � �$ �� � %� �& �� � � �'���( �35 PGCon 2009, Ottawa © 2009 Aster Data Systems

Aster Loader

• )8G# 5G7G9*69�+,-../01/2 34567 8494 :79: ;57497 4 8494 954<:=75 >75=?5@4<;7 A?99B7<7;CDEF.GHIF02 JKB9L>B7 3?4875: 457 @4>>78 9? MN?5C75: ?< N?5C75 <?87:D O=975 >459L9L?<L<6P 3?4875QLBB >454BB7B B?48 K<LRK7 8494 L<9? MN?5C75:DS/0/TIH2 3L<745BU :;4B4AB7 B?48 >75=?5@4<;7 V=5?@ 3?4875: 9? N?5C75:WScalable Partitioning+,-../01/2 X459L9L?<L<6 @K:9 A7 :;4B4AB7 4<8 <?9 L@>787 9Y7 B?48L<6 >5?;7::EF.GHIF02 Z4;Y 3?4875 ;?<94L<: 4 [X459L9L?<75\P QYL;Y K:7: 4< 4B6?5L9Y@ 9? 4::L6< 8494 L<9?[AK;C79:\D Z4;Y AK;C79 L: K<LRK7BU @4>>78 9? QL9YL< 9Y7 ;BK:975DS/0/TIH2 ]4:9P L<97BBL67<9 >459L9L?<L<6 8K5L<6 @4::LM7^:;4B7 8494 B?48:• _G!7= `G9#769�+,-../01/2 ]4LB78 <?87: @4U B?:7 8494 ?5 :L6<L=L;4<9BU 85?> B?48 >75=?5@4<;7DEF.GHIF02 a= 4 <?87 =4LB:P ?9Y75 :9574@: ;?<9L<K7 9? B?48 Vbcd >75=?5@4<;7 YL9WD O8@L<: ;4< 74:LBU57;?M75 B?:9 8494 AU 57^B?48L<6 eKBC ]77875 8494DS/0/TIH2 f494 B?:: >5?97;9L?< g >75=?5@4<;7 ;?<:L:97<;U 8K5L<6 <?87 =4LBK57D

36 PGCon 2009, Ottawa © 2009 Aster Data Systems