big data & postgresql - using tablesample to analyze very large datasets

16
© 2ndQuadrant 2016 Big Data & PostgreSQL Using TABLESAMPLE to Analyze Very Large Datasets By Umair Shahid

Upload: umair-shahid

Post on 16-Apr-2017

408 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Big Data & PostgreSQL - Using TABLESAMPLE to Analyze Very Large Datasets

© 2ndQuadrant 2016

Big Data & PostgreSQLUsing TABLESAMPLE to Analyze

Very Large Datasets

By Umair Shahid

Page 2: Big Data & PostgreSQL - Using TABLESAMPLE to Analyze Very Large Datasets

© 2ndQuadrant 2016

Who am I?Got “pushed” into PostgreSQL in 2004, ended up

falling in love with itNot a hardcore techie, yet passionate about open

source softwareHeading the productization efforts at 2ndQuadrantInterested in Big Data, specifically the newer

PostgreSQL features supporting it

Page 3: Big Data & PostgreSQL - Using TABLESAMPLE to Analyze Very Large Datasets

© 2ndQuadrant 2016

What is the problem?Number of Rows Size on Disk (MB) Time Taken (ms)

1k 0.23 219.706

100k 24 1,302.135

1M 195 7,696.386

5M 951 40,691.603

10M 1,923 60,012.457

100M 19,456 801,493.319

Page 4: Big Data & PostgreSQL - Using TABLESAMPLE to Analyze Very Large Datasets

© 2ndQuadrant 2016

Why is this significant?Data mining has typically been a painful processMajor contributor to the pain has been the time it

takes for queries to returnMany false steps before the required data is

identifiedWaiting time is wasted timeSampling, count based or time based, reduces the

wasted time significantly

Page 5: Big Data & PostgreSQL - Using TABLESAMPLE to Analyze Very Large Datasets

© 2ndQuadrant 2016

What is TABLESAMPLE?

Ability to read a random sample of data in a table

Defined in SQL:2003 (5th revision of SQL)

Implemented in PostgreSQL 9.5

Page 6: Big Data & PostgreSQL - Using TABLESAMPLE to Analyze Very Large Datasets

© 2ndQuadrant 2016

Syntax

SELECT select_expression

FROM table_name

TABLESAMPLE sampling_method ( argument [, ...] )

[ REPEATABLE ( seed ) ]

...

Page 7: Big Data & PostgreSQL - Using TABLESAMPLE to Analyze Very Large Datasets

© 2ndQuadrant 2016

sampling_methodargument is percentage of rowsSYSTEM

Block level samplingVery fastNon-independent rows

BERNOULLIRow level samplingSlower than SYSTEMIndependent rows (uniformly random)

Page 8: Big Data & PostgreSQL - Using TABLESAMPLE to Analyze Very Large Datasets

© 2ndQuadrant 2016

Page 9: Big Data & PostgreSQL - Using TABLESAMPLE to Analyze Very Large Datasets

© 2ndQuadrant 2016

Demo sampling methods

Page 10: Big Data & PostgreSQL - Using TABLESAMPLE to Analyze Very Large Datasets

© 2ndQuadrant 2016

REPEATABLE results(Reminder: [ REPEATABLE ( seed ) ])

Optional argumentUsed if random, yet repeatable results are

requiredseed and argument need to be the same to

produce repeatable resultsAny changes made to the table will result in a

different data set

Page 11: Big Data & PostgreSQL - Using TABLESAMPLE to Analyze Very Large Datasets

© 2ndQuadrant 2016

Now it gets interesting … TABLESAMPLE allows for additional sampling methods via

extensionstsm_system_time specifies max number of milliseconds

to spend reading a tableImplements the syntax:

SELECT select_expression

FROM table_name

TABLESAMPLE SYSTEM_TIME (argument)

Page 12: Big Data & PostgreSQL - Using TABLESAMPLE to Analyze Very Large Datasets

© 2ndQuadrant 2016

Demo tsm_system_time

Page 13: Big Data & PostgreSQL - Using TABLESAMPLE to Analyze Very Large Datasets

© 2ndQuadrant 2016

Enter Orange ...Funded by AXLE (

http://axleproject.eu)Same project funded

TABLESAMPLEAvailable integrated with

PostgreSQL in 2UDA (http://2ndquadrant.com/2uda)

Uses TABLESAMPLE to very quickly create visualizations for data

Can quickly create predictive models

Page 14: Big Data & PostgreSQL - Using TABLESAMPLE to Analyze Very Large Datasets

© 2ndQuadrant 2016

Demo OrangeYou can find a very helpful tutorial at

http://2ndquadrant.com/2uda

Page 15: Big Data & PostgreSQL - Using TABLESAMPLE to Analyze Very Large Datasets

© 2ndQuadrant 2016

Other Big Data features in PostgreSQL

JSON & JSONBHSTOREXMLScale-out by partitioning

Check out Postgres-XL (http://www.postgres-xl.org/)

etc ...

Page 16: Big Data & PostgreSQL - Using TABLESAMPLE to Analyze Very Large Datasets

© 2ndQuadrant 2016

Umair ShahidEmail: [email protected]

Twitter: @pg_umair

2ndQuadrant is hiring - All geographies!

Thank you for your time!