vizatra: the rdbms friendly vizualization tool vizatra.pdfcopyright © 2014 criteo. title:...

52
Copyright © 2014 Criteo Vizatra: the RDBMS Friendly Vizualization Tool Justin Coffey, Sr Staff Devlead Analytics Infrastructure, Criteo - @jqcoffey, [email protected]

Upload: others

Post on 15-Feb-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Vizatra: the RDBMS Friendly Vizualization Tool

Justin Coffey, Sr Staff Devlead Analytics Infrastructure, Criteo - @jqcoffey, [email protected]

Page 2: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

What we’ll be taking about

Why we built yet another dashboarding tool!

We’ll be covering: • What makes Vizatra play nice with RDBMSes • A real-world use case as rationale • A demo of Vizatra in action • Why Finatra

Page 3: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

On with it, then. Who is Criteo?

Criteo is the biggest tech company you’ve never heard of.

Page 4: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Criteo: what we do

Criteo is a Performance Advertising CompanyWe drive conversions for our advertisers across multiple channels

Banner AdsMobile

EmailSearch

We pay for displays, charge for traffic to our advertisers and optimize for conversions

With a 90% retention rate, our clients love us!

Page 5: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Criteo: data scale

•3 Billion ads served per day

•50 Billion events logged per day

•25 TB generated per day

Page 6: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Scale of the Analytics Stack at Criteo

25+ TB ingested / day

1000+ jobs / day

5+PB

Under Management

100+ Analysts 200+ Engineers

1000+ Sales and Ops

Page 7: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Criteo: Scale of the Rooftop Deck

Page 8: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Criteo: Scale of the Rooftop Deck

Page 9: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Criteo: Scale of the Rooftop Deck

Page 10: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

A fairly simple stack

Hadoop for Primary Storage and MapReduce

Cascading, Scalding and Hive for Data Transformation

Hive and Vertica for Data Warehousing

Tableau and ROLAP Cube for Structured Data Access

Page 11: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

The stack

But wait. You said something about Vizatra?

Page 12: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Vizatra for fast operational reporting

Hadoop for Primary Storage and MapReduce

Cascading, Scalding and Hive for Data Transformation

Hive and Vertica for Data Warehousing

Tableau and ROLAP Cube for Structured Data Access.

Vizatra for speed

Page 13: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Vizatra

Today we’ll be talking about Vizatra

Page 14: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Why vizatra?

There are lots of dashboarding tools out there, why invent another one?

In short, they all use query builders and query builders are naïve…

Page 15: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Why Vizatra: Query Building in Tableau

This simple build expression:

Results in this seemingly reasonable query:

Page 16: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Why Vizatra: Query Time for Tableau Query Builder

7 seconds**running on a small 7-node Vertica cluster

Page 17: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Why Vizatra: My Ugly but Optimized Query

Page 18: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Why Vizatra: Query Time for My Ugly Query

1 second**running on a small 7-node Vertica cluster

Page 19: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Why Vizatra: Okay, so wait a minute…

You’re telling me that a perfectly well written query is 7 times slower than that awful query you just showed?

Let me EXPLAIN

Page 20: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Why Vizatra: A Tale of 2 Query Plans

| +---> JOIN HASH [Cost: 118K, Rows: 170M (8K RLE)] (PATH ID: 2) | | Join Cond: (fact_euro_rates_hourly.time_id = fact_network_stats.time_id) | | Materialize at Output: fact_network_stats.revenue_euro, fact_network_stats.tac_euro | | Execute on: All Nodes

Naïve query materializes 170M joined tuples:

| +---> JOIN HASH [Cost: 501K, Rows: 8K] (PATH ID: 2) | | Join Cond: (f.time_id = fact_euro_rates_hourly.time_id) | | Materialize at Output: fact_euro_rates_hourly.rate | | Execute on: All Nodes

Optimized query requires only 8K:

Page 21: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Why Vizatra: When Clients Demand QPS based SLAs

You have to build optimized views of your data,

and to better build them we need to know the queries upfront,

and occasionally, modify them.

Page 22: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Why Vizatra: We Can’t Always Change the Data Model

In production scenarios, large RDBMSes have fairly static schemas.

Page 23: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Why Vizatra: Analysts Love SQL

The best DSL ever written to extract data has already been written—it’s SQL.

So let them write it.

Page 24: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Why Vizatra: Dynamic Queries without Query Building

How can you model OLAP query workloads without a query builder?

Page 25: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Why Vizatra: Hand Coding OLAP Queries

Given four dimensions, time, advertiser, publisher and device, how many queries must we write to mimic an OLAP cube?

4and that’s without any predicates2

Page 26: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Why Vizatra: The Light Bulb Moment

What if we had our Super Analysts write the optimal lower bound query and then de-constructed that for rollups?

That way, we know the worst case query time and can infer that all derivative queries will be at least as fast!

Page 27: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Presenting VizSQL

VizSQL is the query language of Vizatra and is at the very core of how it works with RDBMSes.

Page 28: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Presenting VizSQL: A Simple Query

selecttime_id,advertiser,publisher,device,sum(cost) as cost,sum(revenue) as revenue

fromfacts

group bytime_id,advertiser,publisher,device

How can make this simple SQL query OLAP compatible?

Page 29: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Presenting VizSQL: A Simple Query

selecttime_id,advertiser,publisher,device,sum(cost) as cost,sum(revenue) as revenue

fromfacts

wheretime_id between ? and ?

and advertiser = ?and publisher = ?and device = ?group by

time_id,advertiser,publisher,device

First, let’s add a where clause:

Page 30: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Presenting VizSQL: A Simple Query

selecttime_id,advertiser,publisher,device,sum(cost) as cost,sum(revenue) as revenue

fromfacts

wheretime_id between {time_id:0} and {time_id:1}

and advertiser = {advertiser}and publisher = {publisher}and device = {device}group by

time_id,advertiser,publisher,device

And make it easy to parameterize:

Page 31: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Presenting VizSQL: A Simple Query

select

time_id,

advertiser,

publisher,

device,

sum(cost) as cost,

sum(revenue) as revenuefrom

facts

where

time_id between {time_id:0} and {time_id:1}

and advertiser = {advertiser}

and publisher = {publisher}

and device = {device}group by

time_id,

advertiser,

publisher,

device

Next, let’s annotate it using SQL comments:

Page 32: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Presenting VizSQL: A Simple Query

select/* dim: time_id */time_id,/* dim: advertiser */advertiser,/* dim: publisher */publisher,/* dim: device */device,/* metric: cost */sum(cost) as cost,/* metric: revenue */sum(revenue) as revenue

fromfacts

where/* parameter: time_id */time_id between {time_id:0} and {time_id:1}/* filter: advertiser */

and advertiser = {advertiser}/* filter: publisher */

and publisher = {publisher}/* filter: device */

and device = {device}group by

/* dim: time_id */time_id,/* dim: advertiser */advertiser,/* dim: publisher */publisher,/* dim: device */device

Next, let’s annotate it using SQL comments:

Page 33: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Presenting VizSQL: A Simple Query

select/* dim: time_id */time_id,/* dim: advertiser */advertiser,/* dim: publisher */publisher,/* dim: device */device,/* metric: cost */sum(cost) as cost,/* metric: revenue */sum(revenue) as revenue

fromfacts

where/* parameter: time_id */time_id between ‘2015-07-01’ and ‘2015-08-01’/* filter: advertiser */

and advertiser = {advertiser}/* filter: publisher */

and publisher = {publisher}/* filter: device */

and device = {device}group by

/* dim: time_id */time_id,/* dim: advertiser */advertiser,/* dim: publisher */publisher,/* dim: device */device

Finally, query the full set of advertisers dimensions for a given time range:

Page 34: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Presenting VizSQL: Reducing Query Complexity

Because the query gets less complex as we use it, we infer its run time will not increase in practice.

Page 35: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Presenting VizSQL: Feature Summary

• Rich SQL Support (targeting SQL99)

• Subquery complexity reduction

• Join culling

• Vizatra-side hash joins for denormalized tables

• Result set caching

Page 36: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

A Quick Study with Vertica

Vertica is widely regarded as one of the best analytic RDBMSes on the market.

A quick review of how Vizatra dovetails with its feature set.

Page 37: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Vertica: Highlighting its Strengths

Vertica has a very rich and performant analytic SQL dialect

VizSQL is able to deconstruct every production query we have.**

I don’t know* of a query builder that supports all of it.*but then again, I’m not all knowing.

**and we promise to (try to) support every single query that you can write.

Page 38: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Vertica: Highlighting its Strengths

Vertica uses projections (aka materialized views) to ensure optimal query execution, but they are column order dependent for efficient operations.

VizSQL deconstruction maintains column order and thus has a higher likelihood of triggering merge joins and pipelined grouping.

Page 39: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Vertica: Highlighting its Strengths

Vertica query performance on completely denormalized, numeric-only tables (ie no joins, no strings) is truly impressive.

Vizatra supports local hash joins between facts and small dimensional tables, obviating the need for star-schemas in many cases.

Page 40: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Vertica: has no cache

Vertica has no query cache and can’t pre-cache commonly queried data.

Vizatra currently caches result sets in a shared on-heap LRU cache. Distributed caching is under study.

Page 41: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Vertica Real World Query Optimizing

Some basic optimization efforts.

Page 42: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Vertica Performance: Basic Optimization

By default Vertica projections are segmented by a hash of all columns and unsorted, meaning that for pretty much any operation (join, filter, group by) queries will be sub-optimal.

Page 43: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Vertica Performance: Optimizing for MERGEJOIN

If we sort both tables by the join key, we can take advantage of streaming joins, and vertica is very fast at streaming :)

selectcast(date_trunc('day', f.time_id) as DATE) as day,c.client_name,sum(f.displays),sum(f.clicks)

fromfact_default_projection fjoin dim_campaign c on c.campaign_id = f.campaign_id

group by1, 2

Query against default projection:

Run time: ~3.5s

selectCAST(date_trunc('day', f.time_id) as DATE) as day,c.client_name,sum(f.displays),sum(f.clicks)

fromfacts_ordered_by_campaign_id fjoin dim_campaign c on c.campaign_id = f.campaign_id

group by1, 2

Query against join key ordered projection:

Run time: ~1.5s

Merge Join is More than Twice as fast!

Page 44: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Vertica Performance: Denormalize and PIPELINE it all!

If we can get rid of all joins, and strings, and get group by pipeline operations, Vertica is wicked fast.

selectcampaign_id,sum(displays),sum(clicks)

fromfact_order_by_campaign_time

group by1

Let’s query some numerics without joins:

Run time: ~150ms!

selectcampaign_id,cast(date_trunc(‘day’, time_id)) as day sum(displays),sum(clicks)

fromfact_order_by_campaign_time

group by1, 2

Let’s add back the date:

Run time: ~1000ms?!?

date_trunc breaks group by pipeline operations!

Page 45: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Vertica Performance: Materialize deterministic non-deterministic functions

create projection fact_order_by_day_campaign (campaign_id,displays,clicks,time_day

) AS SELECTcampaign_id,displays,clicks,cast(date_trunc('day', time_id) as DATE) as time_day

FROMfact

ORDER BYcast(date_trunc('day', time_id) as DATE),campaign_id

segmented by HASH(campaign_id,displays,clicks) ALL NODES KSAFE 1

selecttime_day, campaign_id,sum(displays),sum(clicks)

fromfact_order_by_day_campaign

group by1, 2

Let’s re-run that query:

Run time: ~150ms! Yes!

Create a projection with calculations:

Page 46: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Vizatra Design

A quick architectural review

Page 47: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Vizatra Design: Finatra Backbone

Render: D3 + Angular

RDBMS

meta-data request

DataSet request

Fina

tra

• Full Angular Front End

• Pure D3 Charting

• Limited JS dependencies

• Finatra for RESTfulish end points

• Scala parsing-combinators for VizSQL parsing

• HOCON config

• Deploys as self-executable JAR

• Metrics sent to Graphite

• First Criteo app in Mesos!

AppController

Guava Cache

VizSQL Parser

VizSQLDataSource

DataSetService

?

Page 48: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Vizatra Design: Why Finatra?

• Simple and Lightweight

• Doesn’t impose itself on the developer

• Finagle Ecosystem

• Pretty complete and concise documentation

Page 49: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

A Vizatra Demo

Prepare for the demo effect.

Thanks to Sean Lahman for the dataset: http://www.seanlahman.com/baseball-archive/statistics/

Page 50: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

The Vizatra Roadmap

• More time series charts (dual axis, stacked area, etc.)

• Drill down in navigation matrix

• Easier board configuration

• Compile-time query validation

• Non-time series analytics

• Multiple visualizations in the same view

• Composable views

• Saveable views

• Best effort query analysis (ie no need for annotations)

Page 51: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

And One More Thing

We will be open sourcing Vizatra in Q3/Q4.

Page 52: Vizatra: the RDBMS Friendly Vizualization Tool vizatra.pdfCopyright © 2014 Criteo. Title: presenting vizatra.key Created Date: 8/12/2015 8:21:56 PM

Copyright © 2014 Criteo

Thank You

Questions?