vizatra: the rdbms friendly vizualization tool vizatra.pdfcopyright © 2014 criteo. title:...
TRANSCRIPT
Copyright © 2014 Criteo
Vizatra: the RDBMS Friendly Vizualization Tool
Justin Coffey, Sr Staff Devlead Analytics Infrastructure, Criteo - @jqcoffey, [email protected]
Copyright © 2014 Criteo
What we’ll be taking about
Why we built yet another dashboarding tool!
We’ll be covering: • What makes Vizatra play nice with RDBMSes • A real-world use case as rationale • A demo of Vizatra in action • Why Finatra
Copyright © 2014 Criteo
On with it, then. Who is Criteo?
Criteo is the biggest tech company you’ve never heard of.
Copyright © 2014 Criteo
Criteo: what we do
Criteo is a Performance Advertising CompanyWe drive conversions for our advertisers across multiple channels
Banner AdsMobile
EmailSearch
We pay for displays, charge for traffic to our advertisers and optimize for conversions
With a 90% retention rate, our clients love us!
Copyright © 2014 Criteo
Criteo: data scale
•3 Billion ads served per day
•50 Billion events logged per day
•25 TB generated per day
Copyright © 2014 Criteo
Scale of the Analytics Stack at Criteo
25+ TB ingested / day
1000+ jobs / day
5+PB
Under Management
100+ Analysts 200+ Engineers
1000+ Sales and Ops
Copyright © 2014 Criteo
Criteo: Scale of the Rooftop Deck
Copyright © 2014 Criteo
Criteo: Scale of the Rooftop Deck
Copyright © 2014 Criteo
Criteo: Scale of the Rooftop Deck
Copyright © 2014 Criteo
A fairly simple stack
Hadoop for Primary Storage and MapReduce
Cascading, Scalding and Hive for Data Transformation
Hive and Vertica for Data Warehousing
Tableau and ROLAP Cube for Structured Data Access
Copyright © 2014 Criteo
The stack
But wait. You said something about Vizatra?
Copyright © 2014 Criteo
Vizatra for fast operational reporting
Hadoop for Primary Storage and MapReduce
Cascading, Scalding and Hive for Data Transformation
Hive and Vertica for Data Warehousing
Tableau and ROLAP Cube for Structured Data Access.
Vizatra for speed
Copyright © 2014 Criteo
Vizatra
Today we’ll be talking about Vizatra
Copyright © 2014 Criteo
Why vizatra?
There are lots of dashboarding tools out there, why invent another one?
In short, they all use query builders and query builders are naïve…
Copyright © 2014 Criteo
Why Vizatra: Query Building in Tableau
This simple build expression:
Results in this seemingly reasonable query:
Copyright © 2014 Criteo
Why Vizatra: Query Time for Tableau Query Builder
7 seconds**running on a small 7-node Vertica cluster
Copyright © 2014 Criteo
Why Vizatra: My Ugly but Optimized Query
Copyright © 2014 Criteo
Why Vizatra: Query Time for My Ugly Query
1 second**running on a small 7-node Vertica cluster
Copyright © 2014 Criteo
Why Vizatra: Okay, so wait a minute…
You’re telling me that a perfectly well written query is 7 times slower than that awful query you just showed?
Let me EXPLAIN
Copyright © 2014 Criteo
Why Vizatra: A Tale of 2 Query Plans
| +---> JOIN HASH [Cost: 118K, Rows: 170M (8K RLE)] (PATH ID: 2) | | Join Cond: (fact_euro_rates_hourly.time_id = fact_network_stats.time_id) | | Materialize at Output: fact_network_stats.revenue_euro, fact_network_stats.tac_euro | | Execute on: All Nodes
Naïve query materializes 170M joined tuples:
| +---> JOIN HASH [Cost: 501K, Rows: 8K] (PATH ID: 2) | | Join Cond: (f.time_id = fact_euro_rates_hourly.time_id) | | Materialize at Output: fact_euro_rates_hourly.rate | | Execute on: All Nodes
Optimized query requires only 8K:
Copyright © 2014 Criteo
Why Vizatra: When Clients Demand QPS based SLAs
You have to build optimized views of your data,
and to better build them we need to know the queries upfront,
and occasionally, modify them.
Copyright © 2014 Criteo
Why Vizatra: We Can’t Always Change the Data Model
In production scenarios, large RDBMSes have fairly static schemas.
Copyright © 2014 Criteo
Why Vizatra: Analysts Love SQL
The best DSL ever written to extract data has already been written—it’s SQL.
So let them write it.
Copyright © 2014 Criteo
Why Vizatra: Dynamic Queries without Query Building
How can you model OLAP query workloads without a query builder?
Copyright © 2014 Criteo
Why Vizatra: Hand Coding OLAP Queries
Given four dimensions, time, advertiser, publisher and device, how many queries must we write to mimic an OLAP cube?
4and that’s without any predicates2
Copyright © 2014 Criteo
Why Vizatra: The Light Bulb Moment
What if we had our Super Analysts write the optimal lower bound query and then de-constructed that for rollups?
That way, we know the worst case query time and can infer that all derivative queries will be at least as fast!
Copyright © 2014 Criteo
Presenting VizSQL
VizSQL is the query language of Vizatra and is at the very core of how it works with RDBMSes.
Copyright © 2014 Criteo
Presenting VizSQL: A Simple Query
selecttime_id,advertiser,publisher,device,sum(cost) as cost,sum(revenue) as revenue
fromfacts
group bytime_id,advertiser,publisher,device
How can make this simple SQL query OLAP compatible?
Copyright © 2014 Criteo
Presenting VizSQL: A Simple Query
selecttime_id,advertiser,publisher,device,sum(cost) as cost,sum(revenue) as revenue
fromfacts
wheretime_id between ? and ?
and advertiser = ?and publisher = ?and device = ?group by
time_id,advertiser,publisher,device
First, let’s add a where clause:
Copyright © 2014 Criteo
Presenting VizSQL: A Simple Query
selecttime_id,advertiser,publisher,device,sum(cost) as cost,sum(revenue) as revenue
fromfacts
wheretime_id between {time_id:0} and {time_id:1}
and advertiser = {advertiser}and publisher = {publisher}and device = {device}group by
time_id,advertiser,publisher,device
And make it easy to parameterize:
Copyright © 2014 Criteo
Presenting VizSQL: A Simple Query
select
time_id,
advertiser,
publisher,
device,
sum(cost) as cost,
sum(revenue) as revenuefrom
facts
where
time_id between {time_id:0} and {time_id:1}
and advertiser = {advertiser}
and publisher = {publisher}
and device = {device}group by
time_id,
advertiser,
publisher,
device
Next, let’s annotate it using SQL comments:
Copyright © 2014 Criteo
Presenting VizSQL: A Simple Query
select/* dim: time_id */time_id,/* dim: advertiser */advertiser,/* dim: publisher */publisher,/* dim: device */device,/* metric: cost */sum(cost) as cost,/* metric: revenue */sum(revenue) as revenue
fromfacts
where/* parameter: time_id */time_id between {time_id:0} and {time_id:1}/* filter: advertiser */
and advertiser = {advertiser}/* filter: publisher */
and publisher = {publisher}/* filter: device */
and device = {device}group by
/* dim: time_id */time_id,/* dim: advertiser */advertiser,/* dim: publisher */publisher,/* dim: device */device
Next, let’s annotate it using SQL comments:
Copyright © 2014 Criteo
Presenting VizSQL: A Simple Query
select/* dim: time_id */time_id,/* dim: advertiser */advertiser,/* dim: publisher */publisher,/* dim: device */device,/* metric: cost */sum(cost) as cost,/* metric: revenue */sum(revenue) as revenue
fromfacts
where/* parameter: time_id */time_id between ‘2015-07-01’ and ‘2015-08-01’/* filter: advertiser */
and advertiser = {advertiser}/* filter: publisher */
and publisher = {publisher}/* filter: device */
and device = {device}group by
/* dim: time_id */time_id,/* dim: advertiser */advertiser,/* dim: publisher */publisher,/* dim: device */device
Finally, query the full set of advertisers dimensions for a given time range:
Copyright © 2014 Criteo
Presenting VizSQL: Reducing Query Complexity
Because the query gets less complex as we use it, we infer its run time will not increase in practice.
Copyright © 2014 Criteo
Presenting VizSQL: Feature Summary
• Rich SQL Support (targeting SQL99)
• Subquery complexity reduction
• Join culling
• Vizatra-side hash joins for denormalized tables
• Result set caching
Copyright © 2014 Criteo
A Quick Study with Vertica
Vertica is widely regarded as one of the best analytic RDBMSes on the market.
A quick review of how Vizatra dovetails with its feature set.
Copyright © 2014 Criteo
Vertica: Highlighting its Strengths
Vertica has a very rich and performant analytic SQL dialect
VizSQL is able to deconstruct every production query we have.**
I don’t know* of a query builder that supports all of it.*but then again, I’m not all knowing.
**and we promise to (try to) support every single query that you can write.
Copyright © 2014 Criteo
Vertica: Highlighting its Strengths
Vertica uses projections (aka materialized views) to ensure optimal query execution, but they are column order dependent for efficient operations.
VizSQL deconstruction maintains column order and thus has a higher likelihood of triggering merge joins and pipelined grouping.
Copyright © 2014 Criteo
Vertica: Highlighting its Strengths
Vertica query performance on completely denormalized, numeric-only tables (ie no joins, no strings) is truly impressive.
Vizatra supports local hash joins between facts and small dimensional tables, obviating the need for star-schemas in many cases.
Copyright © 2014 Criteo
Vertica: has no cache
Vertica has no query cache and can’t pre-cache commonly queried data.
Vizatra currently caches result sets in a shared on-heap LRU cache. Distributed caching is under study.
Copyright © 2014 Criteo
Vertica Real World Query Optimizing
Some basic optimization efforts.
Copyright © 2014 Criteo
Vertica Performance: Basic Optimization
By default Vertica projections are segmented by a hash of all columns and unsorted, meaning that for pretty much any operation (join, filter, group by) queries will be sub-optimal.
Copyright © 2014 Criteo
Vertica Performance: Optimizing for MERGEJOIN
If we sort both tables by the join key, we can take advantage of streaming joins, and vertica is very fast at streaming :)
selectcast(date_trunc('day', f.time_id) as DATE) as day,c.client_name,sum(f.displays),sum(f.clicks)
fromfact_default_projection fjoin dim_campaign c on c.campaign_id = f.campaign_id
group by1, 2
Query against default projection:
Run time: ~3.5s
selectCAST(date_trunc('day', f.time_id) as DATE) as day,c.client_name,sum(f.displays),sum(f.clicks)
fromfacts_ordered_by_campaign_id fjoin dim_campaign c on c.campaign_id = f.campaign_id
group by1, 2
Query against join key ordered projection:
Run time: ~1.5s
Merge Join is More than Twice as fast!
Copyright © 2014 Criteo
Vertica Performance: Denormalize and PIPELINE it all!
If we can get rid of all joins, and strings, and get group by pipeline operations, Vertica is wicked fast.
selectcampaign_id,sum(displays),sum(clicks)
fromfact_order_by_campaign_time
group by1
Let’s query some numerics without joins:
Run time: ~150ms!
selectcampaign_id,cast(date_trunc(‘day’, time_id)) as day sum(displays),sum(clicks)
fromfact_order_by_campaign_time
group by1, 2
Let’s add back the date:
Run time: ~1000ms?!?
date_trunc breaks group by pipeline operations!
Copyright © 2014 Criteo
Vertica Performance: Materialize deterministic non-deterministic functions
create projection fact_order_by_day_campaign (campaign_id,displays,clicks,time_day
) AS SELECTcampaign_id,displays,clicks,cast(date_trunc('day', time_id) as DATE) as time_day
FROMfact
ORDER BYcast(date_trunc('day', time_id) as DATE),campaign_id
segmented by HASH(campaign_id,displays,clicks) ALL NODES KSAFE 1
selecttime_day, campaign_id,sum(displays),sum(clicks)
fromfact_order_by_day_campaign
group by1, 2
Let’s re-run that query:
Run time: ~150ms! Yes!
Create a projection with calculations:
Copyright © 2014 Criteo
Vizatra Design
A quick architectural review
Copyright © 2014 Criteo
Vizatra Design: Finatra Backbone
Render: D3 + Angular
RDBMS
meta-data request
DataSet request
Fina
tra
• Full Angular Front End
• Pure D3 Charting
• Limited JS dependencies
• Finatra for RESTfulish end points
• Scala parsing-combinators for VizSQL parsing
• HOCON config
• Deploys as self-executable JAR
• Metrics sent to Graphite
• First Criteo app in Mesos!
AppController
Guava Cache
VizSQL Parser
VizSQLDataSource
DataSetService
?
Copyright © 2014 Criteo
Vizatra Design: Why Finatra?
• Simple and Lightweight
• Doesn’t impose itself on the developer
• Finagle Ecosystem
• Pretty complete and concise documentation
Copyright © 2014 Criteo
A Vizatra Demo
Prepare for the demo effect.
Thanks to Sean Lahman for the dataset: http://www.seanlahman.com/baseball-archive/statistics/
Copyright © 2014 Criteo
The Vizatra Roadmap
• More time series charts (dual axis, stacked area, etc.)
• Drill down in navigation matrix
• Easier board configuration
• Compile-time query validation
• Non-time series analytics
• Multiple visualizations in the same view
• Composable views
• Saveable views
• Best effort query analysis (ie no need for annotations)
Copyright © 2014 Criteo
And One More Thing
We will be open sourcing Vizatra in Q3/Q4.
Copyright © 2014 Criteo
Thank You
Questions?