high concurrency,low latency analyticsusing spark/kudu

High concurrency,Low latency analytics

using Spark/KuduChris George

Who is this guy?

Tech we will talk about:KuduSpark

Spark Job ServerSpark Thrift Server

What was the problem?

Apache Kudu

History of Kudu

Columnar vs other types of storage

What if you could update parquet/ORC

easily?

HDFS vs Kudu vs HBase/Cassandra/x

yz

Kudu is purely a storage engine

accessible through api

To add sql queries/more

advanced sql like operations

Impala vs Spark

Kudu Slack Channel

Master and Tablets in Kudu

Range and Hash Partitioning

Number of cores = number of partitions

Partitioning can be on 1+ columns

Composite primary keys important to filter on the key in order

A, B, Ci.e. don’t scan for just B if possible

it will be expensive

Scans on a tablet is single threaded but you can do 200+ scans on a tablet

concurrently

To find your scale... load up a single tablet

Insertion, Update, Deletes.. concurrently

Until it doesn't meet your performance

Partitioning is extremely important

Kudu client is javaPython connectors

comingC++ client

Java Client loops through tablets, but

not concurrently

But you can code the multithread or

contribute

Predicates on any column

Summary of why Kudu?

Predicates/Projections on any column very

quickly at scale

Spark Datasource api:

Reads CSVval df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .option("inferSchema", "true") // Automatically infer data types .load("cars.csv")

Writes CSVval selectedData = df.select("year", "model")selectedData.write .format("com.databricks.spark.csv") .option("header", "true") .save("newcars.csv")

But these are often simlified as:val parquetDataframe = sqlContext.read.parquet(“people.parquet")

parquetDataframe.write.parquet("people.parquet")

I wrote the current version of the Kudu Datasource/Spark

Integration

There are limitations with the datasource

api

Save Modes for datasource api:

append, overwrite, ignore, error

append = insert

overwrite = truncate + insert

ignore = create if not exists

error = throw exception if the data

exists

What if I want to update? Nope

What about deletes? Not individually

So how do you support

updates/deletes?

By not using the datasource api.. but I'll talk more about that in

a minute

Immutability of dataframes

So why use datasource api?

Because it's smarter than it appears for

reads

Pushdown predicates and

projections

Pushdown predicates:val df = sqlContext.read.options(Map("kudu.master" -> "kudu.master:7051","kudu.table" -> “kudu_table")).kudu

df.filter("id" >= 5).show()

Datasource has knowledge of what can be pushed down to

underlying storeand what can not.

Why am I telling you this?

Cause if you want things to be fast you need to know what is

not pushed down!

EqualToGreaterThan

GreaterThanOrEqualLessThan

LessThanOrEqualAnd

https://github.com/cloudera/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala#L159





Did you notice whats missing?EqualTo

GreaterThanGreaterThanOrEqual

LessThanLessThanOrEqual

And

"OR"So spark will use it's optimizer to run two separate kudu scans for the OR"IN" is coming very soon, nuanced

performance details

btw if you register the dataframe as a temp table in

spark"select * from someDF where

id>=5" will also do pushdowns

"select * from someDF where id>=5" will also

do pushdowns

things like select * from someDF where

lower(name)="joe"will pull the entire table into

memoryprobably a bad thing

Projections will also be pushed down to kudu so your not retrieving

the entire rowdf.select("id", "name")

select id, name from someDf

Looked at lots of existing datasource’s

to design kudu’s

How does kudu do updates/deletes in

spark?

// Use KuduContext to create, delete, or write to Kudu tablesval kuduContext = new KuduContext("kudu.master:7051")

// Insert datakuduContext.insertRows(df, "test_table")// Delete datakuduContext.deleteRows(filteredDF, "test_table")// Upsert datakuduContext.upsertRows(df, "test_table")// Update dataval alteredDF = df.select("id", $"count" + 1)kuduContext.updateRows(alteredDF, “test_table")http://kudu.apache.org/docs/developing.html

http://kudu.apache.org/docs/developing.html

Upserts are handled server side for performance

Upserts can also be handled through datasource api:

df.write.options(Map("kudu.master"-> "kudu.master:7051", "kudu.table"->"test_table")).mode("append").kudu

You can also create, check existence and delete tables through

api

Additional notes:Kudu datasource currently works with spark 1.xNext release it will support both 1.x and 2.xIt's being improved on regular basis

Number of partitions on the dataframe is related to how many

tablets/partitions are related to the filter.

Partition scans are parallel and have locality awareness in spark

Be sure to set spark locality wait to something other small for low

latency (3 seconds is the spark default)

Spark Job Server (SJS)

Created for low latency jobs on

spark

Persistent contextsReduces runtime of hello world

type of job from 1 second to 10 ms

Rest based api to:Run Jobs

Create contextsCheck status of job both

async/sync

Creating a context calls spark submit (in separate jvm mode)

Uses akka to communicate between rest and spark driver

To create a persistent context you need:

cpu cores + memory footprintname to reference it by

factory to use for the context.ie HiveContextFactory vs

SqlContextFactory

Our average job time is 30ms when coming

through api for simpler retrievals

Jobs need to implement an interface

context will be passed inDON’T CREATE YOUR OWN

SQLCONTEXT!!

Currently only supports spark 1.x

2.x is coming soonish

Keeps track of job runtimes in nice ui

along with additional metrics

You can cache data and it will be available

to later jobs

You can also load objects and they are available to

later jobsvia NamedObject interface

Persistent context can be run in seperate JVM

or within SJS

It does have some sharp edges

though...

Due to jvm classloader contexts need to be

restarted on deploy to pick up new code

Some settings:spark.files.overwrite = true context-per-jvm = true

spray-can: parsing.max-content-length = 256m

spray-can: idle-timeout = 600 sspray-can: request-timeout = 540 s

spark.serializer = "org.apache.spark.serializer.KryoSerializer"

filedao vs sqldao backendhave to build from source/no binary for SJS

hive-site.xml<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:derby:memory:myDB;create=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.apache.derby.jdbc.EmbeddedDriver</value> <description>Driver class name for a JDBC metastore</description> </property>

SparkThrift ServerExtended/Reused hive

thrift server

I run the following on a persistent context:sc.getConf.set("spark.sql.hive.thriftServer.singleSession", "true")sqlContext.setConf("hive.server2.thrift.port", port) // port to run thrift server onHiveThriftServer2.startWithContext(sqlContext)

Now I can connect using

hive-jdbcodbc (microsoft or

simba)

Run a job with joins/ or even just a basic dataframe through

datasource api and registerTempTable

val df = sqlContext.read.options(Map("kudu.master" -> "kudu.master:7051","kudu.table" -> “kudu_table")).kududf.registerTempTable

You could also potentially cache/persist via spark and register that way assuming

joins are expensive

Now you can run queries as if it was a traditional database

Hey thats great, but how fast?500 ms average response time

200 concurrent complex queries1+ Billion rows with 200+ columns

sql queries with 5 predicates, min,max,count some values and group by on 5 columns

No spark caching

We take this a step farther and do complex dataframes and it

is made available as a registered temp table

Questions… If we run out of time send me questions on slack

high concurrency,low latency analyticsusing spark/kudu

Data & Analytics