bdm39: hp vertica bi: sub-second big data analytics your users and developers can truly appreciate -...

28
HP VERTICA BI SUB-SECOND BIG DATA ANALYTICS YOUR USERS AND DEVELOPERS CAN TRULY APPRECIATE PRESENTED BY MINA NAGUIB BIG DATA MONTRÉAL AUGUST 2015

Upload: big-data-montreal

Post on 17-Aug-2015

17 views

Category:

Software


0 download

TRANSCRIPT

HP VERTICA BISUB-SECOND BIG DATA ANALYTICS YOUR USERS AND DEVELOPERS CAN TRULY APPRECIATE

PRESENTED BY MINA NAGUIBBIG DATA MONTRÉAL AUGUST 2015

Director, Platform Engineering@AdGear

Background: Software hacker Network enthusiast Web designer, SQL weaver, kernel debugger, PM, RE, SRE, QA, ...

What I do: Hire great people at AdGear Offer technical leadership Get out of their way Observe, optimize, rinse, repeat

ABOUT ME

AdGear is a digital advertising technology company, providing platforms, ad technology and services to publishers, advertisers, media agencies and ad tech providers.

AdGear delivers a full-stack advertising platform that includes: Demand-Side Platform, Supply-Side Platform, 1st and 3rd Party Ad Server, Attribution and Analytics, and multiple retargeting offerings.

ABOUT ADGEAR

ABOUT ADGEAR

2008 year founded

40 employees

2 offices (514, 416)

~10 billion impressions served per month

0.5 Trillion Bid Requests per month

ABOUT ADGEARWe power these fine purveyors of your favourite internet services:

ABOUT ADGEARAnd many others, sometimes in the background

ADGEAR: DATAInternet advertising generates lots of data. The majority of which is transactional data that must be accurately accounted.

If you can't account for it, it didn't happen. The data generated is often more important than the occurrence of the event itself.

ADGEAR: SOME NUMBERS

September 2008 First event served in production

2008 2 events / second

2010 250 events / second

2012 2,500 events / second

2014 5,500 events / second

ADGEAR: SOME NUMBERS

September 2008 First event served in production

2008 2 events / second

2010 250 events / second

2012 80,000 events / second

2014 200,000 events / second

RTB Changed the game:

ADGEAR: DATAFrom Day 1:

Offer customers a self-serve reporting section in the UI to report on what happened

Make it responsive, pivotable, discoverable, useful and insightful

We're competing against dinosaurs with closed-day banking mentality - go for realtime and semi-realtime

Safe and correct - better say N/A than offer a partial metric

ADGEAR: DATAThe data architecture plan, circa 2008

Step 1: Log the event locally on the server it occurs on

Step 2: Harvest the events

Step 3: ????

Step 4: Profit!

ADGEAR: DATA

Step 1: Log the event locally on the server it occurs on

Step 2: Harvest the events

Step 3: ???? (How hard can this really be ?)

Step 4: Profit!

The data architecture plan, circa 2008

ADGEAR: DATA

2008 2009 2010

The elusive Step 3

Raw event management Home-grown "Harvester" libraryRaw event warehousing Single unix filesystem, .json.gz files, .sqlite files

Raw event analysis+aggregation "Harvester" library streaming abstraction, custom jobsAggregate metrics warehousing PostgreSQL (app-db) tables, key-value design

Reporting Primary web-based app accessing aggregates key-values table

ADGEAR: DATA

2009 2010 2011 2012

Raw event management Home-grown "Harvester" libraryRaw event warehousing Single unix filesystem, .json.gz files, .sqlite CEROD files

Raw event analysis+aggregation "Harvester" library streaming abstraction, custom jobsAggregate metrics warehousing PostgreSQL (app-db) tables, key-value design

Reporting Primary web-based app accessing aggregates key-values table

The elusive Step 3

ADGEAR: DATA

2009 2010 2011 2012

Raw event management Home-grown "Harvester" + "DDAL" librariesRaw event warehousing Multiple servers, unix filesystem, .json.gz files, .sqlite CEROD files

Raw event analysis+aggregation "Harvester" + "DDAL" libraries streaming abstraction, custom jobsAggregate metrics warehousing PostgreSQL (app-db) tables, key-value design

Reporting Primary web-based app accessing aggregates key-values table

The elusive Step 3

ADGEAR: DATA

2010 2011 2012 2013

Raw event management Home-grown "Harvester" + "DDAL" librariesRaw event warehousing Multiple servers, unix filesystem, .json.gz files, .sqlite CEROD files

Raw event analysis+aggregation "Harvester" + "DDAL" libraries streaming abstraction, custom jobsAggregate metrics warehousing Dedicated MongoDB server, hourly documents

Reporting Dedicated reporting service abstracting away Mongo DB

The elusive Step 3

ADGEAR: DATA

2011 2012 2013 2014

Raw event management Home-grown "Harvester" + "DDAL" librariesRaw event warehousing Multiple servers, unix filesystem, .json.gz files, .sqlite CEROD files

Raw event analysis+aggregation "Harvester" + "DDAL" libraries streaming abstraction, custom jobsAggregate metrics warehousing Dedicated PostgreSQL reporting DB, star schema

Reporting Dedicated reporting service abstracting away PG DB

The elusive Step 3

ADGEAR: DATA

2011 2012 2013 2014 2015

Raw event management Home-grown push mechanism

Raw event warehousing HDFS, .json.gz files, .avro files

Raw event analysis+aggregation Hadoop M+R, Pig, Hive

Aggregate metrics warehousing Dedicated PostgreSQL reporting DB, star schemaReporting Dedicated reporting service abstracting away PG DB

The elusive Step 3

ADGEAR: DATA

2012 2013 2014 2015

Raw event management Home-grown push mechanismRaw event warehousing HDFS, .json.gz files, .avro files

Raw event analysis+aggregation Hadoop M+R, Pig, HiveAggregate metrics warehousing Vertica

Reporting Dedicated reporting service abstracting away Vertica DB

The elusive Step 3

ADGEAR: DATA

2015

Raw event management Home-grown push mechanism, Kafka

Raw event warehousing HDFS, .json.gz files, .avro files Raw event analysis+aggregation Hadoop, HP Vertica, HiveAggregate metrics warehousing HP Vertica

Reporting Dedicated reporting service abstracting away Vertica DB

The elusive Step 3

ADGEAR: DATA

= The "Secret Sauce" *

* Actual unsolicited description used by myself and other Vertica customers

From a dev/ops perspective, Vertica is:

• A columnar database• Offers a familiar DB/Schema/Table/Row/Column

paradigm• Distributed + Horizontally scalable• Easily accessible from the CLI and many programming

languages• Extremely fast• SOLID SQL support. Not 100% ANSI SQL-99

Compliant, but more than enough for our use cases• Stable, predictable, easy to administer• Well documented• Enterprise-ready, in production at many large

companies

From a dev/ops perspective, Vertica is:

• A columnar database• Offers a familiar DB/Schema/Table/Row/Column

paradigm• Distributed + Horizontally scalable• Easily accessible from the CLI and many programming

languages• Extremely fast• SOLID SQL support. Not 100% ANSI SQL-99

Compliant, but more than enough for our use cases• Stable, predictable, easy to administer• Well documented• Enterprise-ready, in production at many large

companies

From a product perspective:

Extremely fast

At AdGear

Fact Table NHour Dimension1 Dimension2 Dimension3 Dimension...N Metric1 Metric2 Metric...N

2015-08-05-01 1 55 105 9 1 0 02015-08-05-01 1 56 106 9 3551 6 92015-08-05-01 1 56 107 9 2382 6 662015-08-05-01 2 901 107 33 23 4 0

Growth via Append-Only row insertion

At AdGear

Fact Table 1 Fact Table 2 Fact Table 3

Dimension Table 1 Dimension Table 3 Dimension Table 5Dimension Table 2 Dimension Table 4

Simple SQL joins

At AdGear

Let's see it in action

To download and try:https://my.vertica.com/community/

Free, up to 1TB, 3 nodes, no time limit

Get in touch:http://adgear.com/

[email protected]

Mina NaguibTo learn more:http://www.vertica.com/

Thank you