snappydata ad analytics use case -- bdam meetup sept 14th

49
Speed of thought Analytics for Ad impressions using Spark 2.0 and Snappydata www.snappydata. io Jags Ramnarayan CTO, Co-founder @ SnappyData

Upload: snappydata

Post on 12-Jan-2017

352 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Speed of thought Analytics for Ad impressions using Spark 2.0 and Snappydata

www.snappydata.io

Jags RamnarayanCTO, Co-founder @ SnappyData

Page 2: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Our Pedigree

SnappyData

SpinOut

● New Spark-based open source project started by Pivotal GemFire founders+engineers

● Decades of in-memory data management experience

● Focus on real-time, operational analytics: Spark inside an OLTP+OLAP database

Funded by Pivotal, GE, GTD Capital

Page 3: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Mixed workloads are common – Why is Lambda the answer?

Page 4: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Lambda Architecture is Complex

• Complexity: learn and master multiple products data models, disparate APIs, configs

• Slower

• Wasted resources

Source: Highscalability.com

Page 5: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Can we simplify & optimize?

Perhaps a single clustered DB that can manage stream state, transactional data and run OLAP

queries?

Page 6: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Ad Analytics using Lambda

Page 7: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Ad Impression AnalyticsAd Network Architecture – Analyze log impressions in real time

Page 8: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Ad Impression Analytics

Ref - https://chimpler.wordpress.com/2014/07/01/implementing-a-real-time-data-pipeline-with-spark-streaming/

Page 9: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Bottlenecks in the write pathStream micro batches in

parallel from Kafka to each Spark executor

Filter and collect event for 1 minute. Reduce to 1 event per Publisher,

Geo every minute

Execute GROUP BY … Expensive Spark Shuffle …

Shuffle again in DB cluster … data format changes …

serialization costs

val input :DataFrame= sqlContext.read .options(kafkaOptions).format("kafkaSource") .stream()

val result :DataFrame = input .where("geo != 'UnknownGeo'") .groupBy( window("event-time", "1min"), Publisher", "geo") .agg(avg("bid"), count("geo"),countDistinct("cookie"))

val query = result.write.format("org.apache.spark.sql.cassandra") .outputMode("append") .startStream("dest-path”)

Page 10: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Bottlenecks in the Write Path

• Aggregations – GroupBy, MapReduce• Joins with other streams, Reference data

Shuffle Costs (Copying, Serialization) Excessive copying in Java based Scale out stores

Page 11: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Impedance mismatch with KV stores

We want to run interactive “scan”-intensive queries:

- Find total uniques for Ads grouped on geography

- Impression trends for Advertisers(group by query)

Two alternatives: row-stores vs. column stores

Page 12: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Goal – Localize processingFast key-based lookup

But, too slow to run aggregations, scan based interactive queries

Fast scans, aggregations

Updates and random writes are very difficult

Consumes too much memory

Page 13: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Why columnar storage in-memory?

Source: MonetDB

Page 14: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

How did Spark 2.0 do?- Spark 2.0, MacBook Pro 4 core, 2.8 Ghz Intel i7, enough RAM- Data set : 105 Million records

Parquet files in OS Buffer

Managed in Spark memory

~ 3 seconds ~ 600 millisecondsselect AVG(bid) from AdImpressions

Page 15: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Spark 2.0 query plan What is different?

Scan over 105 million Integers

Shuffle results from each partition so we can compute Avg across all partitions - is cheap in this case … only 11 partitions

select AVG(ArrDelay) from airline

Page 16: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Whole Stage Code Generation

Typical Query engine design- Each Operator implemented using

functions- And, functions imply chasing

pointers … Expensive

Code Generation in Spark -- Remove virtual function calls -- Array, variables instead of objects -- Capitalize on modern CPU cache

Aggregate

Filter

Scan

Project

How to remove complexity? Add a layerHow to improve perf? Remove a layer

Filter() { getNextRow { get a row from scan() //child Apply filter condition true: return row }

Scan() { getNextRow { get row from fileInputStream }

Page 17: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Good enough? Hitting the CPU Wall?

select count(*) , advertiser From history t1, current t2, adimpress t3

Where t1 Join t2 Join t3

group by geo

order by count desc limit 8

Distributed Joins can be very expensive

1 10

Concur-rency

Concur-rency

0

40

80

120

160

200

ResponseTime in seconds

ResponseTime in seconds

Page 18: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

- DRAM is still relatively expensive for the deluge of data

- Analytics in the cloud requires fluid data movement -- How do you move large volumes to/from clouds?

Challenges with In-memory Analytics

Page 19: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

• Most apps happy to tradeoff 1% accuracy for 200x speedup!

• Can usually get a 99.9% accurate answer by only looking at a tiny fraction of data!

• Often can make perfectly accurate decisions without having perfectly accurate answers!

• A/B Testing, visualization, ... • The data itself is usually noisy

• Processing entire data doesn’t necessarily mean exact answers!

• Inference is probabilistic anyway

Use statistical techniques to shrink data?

Page 20: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

SnappyData

A Hybrid Open source system for Transactions, Analytics, Streaming

(https://github.com/SnappyDataInc/snappydata)

Page 21: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

SnappyData – In-memory Hybrid DB with Spark

A Single Unified Cluster: OLTP + OLAP + Streaming

for real-time analytics

Batch design, high throughput

Real-time designLow latency, HA,

concurrency

Vision: Drastically reduce the cost and complexity in modern big data

Rapidly Maturing Matured over 13 years

Page 22: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

SnappyData fuses speed + serving layer

Process, store streams

Kafka

Snappy Data Server – Spark Executor + Store

Batch compute

Reference data

Lazy write, Fetch on demand

RDB

HDFS

In-memory compute, state

Current Operational

data

External data S3, Rdb, XML… Spark API ++ - Java, Scala, Python, R, REST

Synopses data

Interactive analytic queries

History data

Page 23: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Realizing ‘speed-of-thought’ Analytics

Rows

Columnar

Stream processingKafka

Queue (partition)

Snappy Data Server – Spark Executor + Store

Index

ProcessSpark or SQL

Program

Batch compute

Hybrid Store

Random writes

RDB(Reference

data)

HDFS

MPP DB

In-memory compute, state

overflow

Local persist

Spark API ++ - Java, Scala, Python, R, REST

Synopses

Interactive analytic queries

Page 24: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

• Fast - Stream, ingested data colocated on shared key - Tables colocated on shared key - Far less copying, serialization - Improvements to vectorization (20X faster than spark)

• Use less memory, CPU - Maintain only “Hot/active” data in RAM - Summarize all data using Synopses

• Flexible - Spark. Enough said.

Fast, Fewer resources, Flexible

Page 25: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Features

- Deeply integrated database for Spark - 100% compatible with Spark - Extensions for Transactions (updates), SQL stream processing - Extensions for High Availability - Approximate query processing for interactive OLAP

- OLTP+OLAP Store - Replicated and partitioned tables - Tables can be Row or Column oriented (in-memory & on-disk) - SQL extensions for compatibility with SQL Standard - create table, view, indexes, constraints, etc

Page 26: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

TPC-H: 10X-20X faster than Spark 2.0

Page 27: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Interactive-Speed Analytic Queries – Exact or Approximate

Select avg(Bid), Advertiser from T1 group by AdvertiserSelect avg(Bid), Advertiser from T1 group by Advertiser with error 0.1

Speed/Accuracy tradeoffEr

ror (

%)

30 mins

Time to Execute onEntire Dataset

InteractiveQueries

2 secExecution Time 27

100 secs2 secs

1% Error

Page 28: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Query execution with accuracy guaranteePARSE QUERY

Can Query be executed on

Cache? - Recent time window- Computable from samples- Within error constraints

- Point query on history- Outlier query- Very complex query

Parallely Execute on Base table

In-memory Execution with

Error bar

Response

Response

No

Yes

Page 29: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Synopses Data Engine Features• Support for uniform sampling• Support for stratified sampling

- Solutions exist for stored data (BlinkDB)- SnappyData works for infinite streams of

data too• Support for exponentially decaying windows over

time• Support for synopses

- Top-K queries, heavy hitters, outliers, ...• [future] Support for joins• Workload mining (http://CliffGuard.org)

Page 30: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Uniform (Random) Sampling

ID Advertiser Geo Bid

1 adv10 NY 0.00012 adv10 VT 0.00053 adv20 NY 0.00024 adv10 NY 0.00035 adv20 NY 0.00016 adv30 VT 0.0001

Uniform SampleID Advertiser Geo Bid Sampling

Rate3 adv20 NY 0.0002 1/35 adv20 NY 0.0001 1/3

SELECT avg(bid)FROM AdImpresssionsWHERE geo = ‘VT’

Original Table

Page 31: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Uniform (Random) Sampling

ID Advertiser Geo Bid

1 adv10 NY 0.00012 adv10 VT 0.00053 adv20 NY 0.00024 adv10 NY 0.00035 adv20 NY 0.00016 adv30 VT 0.0001

Uniform SampleID Advertiser Geo Bid Sampling

Rate3 adv20 NY 0.0002 2/35 adv20 NY 0.0001 2/31 adv10 NY 0.0001 2/32 adv10 VT 0.0005 2/3

SELECT avg(bid)FROM AdImpresssionsWHERE geo = ‘VT’

Original Table Larger

Page 32: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Stratified Sampling

ID Advertiser Geo Bid

1 adv10 NY 0.00012 adv10 VT 0.00053 adv20 NY 0.00024 adv10 NY 0.00035 adv20 NY 0.00016 adv30 VT 0.0001

Stratified Sample on GeoID Advertiser Geo Bid Sampling

Rate3 adv20 NY 0.0002 1/42 adv10 VT 0.0005 1/2

SELECT avg(bid)FROM AdImpresssionsWHERE geo = ‘VT’

Original Table

Page 33: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Sketching techniques● Sampling not effective for outlier detection

○ MAX/MIN etc● Other probabilistic structures like CMS, heavy hitters, etc● SnappyData implements Hokusai

○ Capturing item frequencies in timeseries● Design permits TopK queries over arbitrary time intervals

(Top100 popular URLs)SELECT pageURL, count(*) frequency FROM TableWHERE …. GROUP BY ….ORDER BY frequency DESCLIMIT 100

Page 34: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Synopses Data Engine Demo

Zeppelin Spark

Interpreter(Driver)

Zeppelin Server

Row cacheColumnarcompressed

Spark Executor JVM

Row cacheColumnarcompressed

Spark Executor JVM

Row cacheColumnarcompressed

Spark Executor JVM

Page 35: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Unified OLAP/OLTP streaming w/ Spark

● Far fewer resources: TB problem becomes GB.○ CPU contention drops

● Far less complex○ single cluster for stream ingestion, continuous queries,

interactive queries and machine learning● Much faster

○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive

Page 36: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

www.snappydata.io

SnappyData is Open Source● Ad Analytics example/benchmark - https://github.com/

SnappyDataInc/snappy-poc

● https://github.com/SnappyDataInc/snappydata

● Learn more www.snappydata.io/blog● Connect:

○ twitter: www.twitter.com/snappydata○ facebook: www.facebook.com/snappydata○ slack: http://snappydata-slackin.herokuapp.com

Page 37: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

EXTRAS

Page 38: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Use Case Patterns1. Operational Analytics DB

- Caching for Analytics over disparate sources- Federate query between samples and backend’

2. Stream analytics for Spark Process streams, transform, real-time scoring, store, query

3. In-memory transactional store Highly concurrent apps, SQL cache, OLTP + OLAP

Page 39: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

How SnappyData Extends Spark

Page 40: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Snappy Spark Cluster Deployment topologies • Snappy store and Spark

Executor share the JVM memory

• Reference based access – zero copy

• SnappyStore is isolated but use the same COLUMN FORMAT AS SPARK for high throughput

Unified Cluster

Split Cluster

Page 41: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Simple API – Spark Compatible

● Access Table as DataFrameCatalog is automatically recovered

● Store RDD[T]/DataFrame can be stored in SnappyData tables

● Access from Remote SQL clients

● Addtional API for updates, inserts, deletes

//Save a dataFrame using the Snappy or spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props );

//save using DataFrame APIdataDF.write.format("ROW").mode(SaveMode.Append).options(props).saveAsTable(”T1");

val impressionLogs: DataFrame = context.table(colTable)val campaignRef: DataFrame = context.table(rowTable) val parquetData: DataFrame = context.table(parquetTable)<… Now use any of DataFrame APIs … >

Page 42: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Extends SparkCREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘JDBC | ROW | COLUMN ’OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS",

// Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT",….. [AS select_statement];

Page 43: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Simple to Ingest Streams using SQL

Consume from stream

Transform raw data

Continuous Analytics

Ingest into in-memory Store

Overflow table to HDFS

Create stream table AdImpressionLog(<Columns>) using directkafka_stream options (

<socket endpoints> "topics 'adnetwork-topic’ “,

"rowConverter ’ AdImpressionLogAvroDecoder’ )

streamingContext.registerCQ(

"select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques from AdImpressionLog window (duration '2' seconds, slide '2' seconds)where geo != 'unknown' group by publisher, geo”)// Register CQ

.foreachDataFrame(df => { df.write.format("column").mode(SaveMode.Append)

.saveAsTable("adImpressions")

Page 44: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Unified Cluster Architecture

Page 45: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

How do we extend Spark for Real Time?• Spark Executors are long

running. Driver failure doesn’t shutdown Executors

• Driver HA – Drivers run “Managed” with standby secondary

• Data HA – Consensus based clustering integrated for eager replication

Page 46: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

How do we extend Spark for Real Time?• By pass scheduler for low

latency SQL

• Deep integration with Spark Catalyst(SQL) – collocation optimizations, indexing use, etc

• Full SQL support – Persistent Catalog, Transaction, DML

Page 47: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

AdImpression DemoSpark, SQL Code Walkthrough, interactive SQL

Page 48: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Concurrent Ingest + Query Performance

• AWS 4 c4.2xlarge instances - 8 cores, 15GB mem

• Each node parallely ingests stream from Kafka

• Parallel batch writes to store (32 partitions)

• Only few cores used for Stream writes as most of CPU reserved for OLAP queries

Spark-Cassandra Spark-InMemoryDB SnappyData0

100000

200000

300000

400000

500000

600000

700000

Stream ingestion rate(On 4 nodes with cap on CPU to allow for queries)

Per s

econ

d Th

roug

hput

https://github.com/SnappyDataInc/snappy-poc

2X – 45X faster (vs Cassandra, Memsql)

Page 49: SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

Concurrent Ingest + Query Performance

30M 60M 90M 30M 60M 90M 30M 60M 90M

Spark-Cassandra

Spark-InMemoryDBl

SnappyData

0

5000

10000

15000

20000

25000

30000

35000

40000

20346

65061 93960

3649 5801 7295

1056 1571 2144

Q1

Sample “scan” oriented OLAP query(Spark SQL) performance executed while ingesting data

select count(*) AS adCount, geo from adImpressions group by geo order by adCount desc limit 20;

Response Time(millis)

https://github.com/SnappyDataInc/snappy-poc

2X – 45X faster