high concurrency,low latency analyticsusing spark/kudu
TRANSCRIPT
![Page 1: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/1.jpg)
High concurrency,Low latency analytics
using Spark/KuduChris George
![Page 2: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/2.jpg)
Who is this guy?
![Page 3: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/3.jpg)
Tech we will talk about:KuduSpark
Spark Job ServerSpark Thrift Server
![Page 4: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/4.jpg)
What was the problem?
![Page 5: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/5.jpg)
Apache Kudu
![Page 6: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/6.jpg)
History of Kudu
![Page 7: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/7.jpg)
Columnar vs other types of storage
![Page 8: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/8.jpg)
What if you could update parquet/ORC
easily?
![Page 9: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/9.jpg)
HDFS vs Kudu vs HBase/Cassandra/x
yz
![Page 10: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/10.jpg)
Kudu is purely a storage engine
accessible through api
![Page 11: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/11.jpg)
To add sql queries/more
advanced sql like operations
![Page 12: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/12.jpg)
Impala vs Spark
![Page 13: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/13.jpg)
Kudu Slack Channel
![Page 14: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/14.jpg)
Master and Tablets in Kudu
![Page 15: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/15.jpg)
Range and Hash Partitioning
![Page 16: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/16.jpg)
Number of cores = number of partitions
![Page 17: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/17.jpg)
Partitioning can be on 1+ columns
![Page 18: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/18.jpg)
Composite primary keys important to filter on the key in order
A, B, Ci.e. don’t scan for just B if possible
it will be expensive
![Page 19: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/19.jpg)
Scans on a tablet is single threaded but you can do 200+ scans on a tablet
concurrently
![Page 20: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/20.jpg)
To find your scale... load up a single tablet
Insertion, Update, Deletes.. concurrently
Until it doesn't meet your performance
![Page 21: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/21.jpg)
Partitioning is extremely important
![Page 22: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/22.jpg)
Kudu client is javaPython connectors
comingC++ client
![Page 23: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/23.jpg)
Java Client loops through tablets, but
not concurrently
![Page 24: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/24.jpg)
But you can code the multithread or
contribute
![Page 25: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/25.jpg)
Predicates on any column
![Page 26: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/26.jpg)
Summary of why Kudu?
![Page 27: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/27.jpg)
Predicates/Projections on any column very
quickly at scale
![Page 28: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/28.jpg)
Spark
![Page 29: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/29.jpg)
Spark Datasource api:
![Page 30: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/30.jpg)
Reads CSVval df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .option("inferSchema", "true") // Automatically infer data types .load("cars.csv")
Writes CSVval selectedData = df.select("year", "model")selectedData.write .format("com.databricks.spark.csv") .option("header", "true") .save("newcars.csv")
![Page 31: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/31.jpg)
But these are often simlified as:val parquetDataframe = sqlContext.read.parquet(“people.parquet")
parquetDataframe.write.parquet("people.parquet")
![Page 32: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/32.jpg)
I wrote the current version of the Kudu Datasource/Spark
Integration
![Page 33: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/33.jpg)
There are limitations with the datasource
api
![Page 34: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/34.jpg)
Save Modes for datasource api:
append, overwrite, ignore, error
![Page 35: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/35.jpg)
append = insert
![Page 36: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/36.jpg)
overwrite = truncate + insert
![Page 37: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/37.jpg)
ignore = create if not exists
![Page 38: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/38.jpg)
error = throw exception if the data
exists
![Page 39: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/39.jpg)
What if I want to update? Nope
![Page 40: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/40.jpg)
What about deletes? Not individually
![Page 41: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/41.jpg)
So how do you support
updates/deletes?
![Page 42: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/42.jpg)
By not using the datasource api.. but I'll talk more about that in
a minute
![Page 43: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/43.jpg)
Immutability of dataframes
![Page 44: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/44.jpg)
So why use datasource api?
![Page 45: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/45.jpg)
Because it's smarter than it appears for
reads
![Page 46: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/46.jpg)
Pushdown predicates and
projections
![Page 47: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/47.jpg)
Pushdown predicates:val df = sqlContext.read.options(Map("kudu.master" -> "kudu.master:7051","kudu.table" -> “kudu_table")).kudu
df.filter("id" >= 5).show()
![Page 48: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/48.jpg)
Datasource has knowledge of what can be pushed down to
underlying storeand what can not.
![Page 49: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/49.jpg)
Why am I telling you this?
![Page 50: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/50.jpg)
Cause if you want things to be fast you need to know what is
not pushed down!
![Page 51: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/51.jpg)
EqualToGreaterThan
GreaterThanOrEqualLessThan
LessThanOrEqualAnd
https://github.com/cloudera/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/DefaultSource.scala#L159
![Page 52: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/52.jpg)
Did you notice whats missing?EqualTo
GreaterThanGreaterThanOrEqual
LessThanLessThanOrEqual
And
![Page 53: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/53.jpg)
"OR"So spark will use it's optimizer to run two separate kudu scans for the OR"IN" is coming very soon, nuanced
performance details
![Page 54: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/54.jpg)
btw if you register the dataframe as a temp table in
spark"select * from someDF where
id>=5" will also do pushdowns
![Page 55: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/55.jpg)
"select * from someDF where id>=5" will also
do pushdowns
![Page 56: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/56.jpg)
things like select * from someDF where
lower(name)="joe"will pull the entire table into
memoryprobably a bad thing
![Page 57: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/57.jpg)
Projections will also be pushed down to kudu so your not retrieving
the entire rowdf.select("id", "name")
select id, name from someDf
![Page 58: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/58.jpg)
Looked at lots of existing datasource’s
to design kudu’s
![Page 59: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/59.jpg)
How does kudu do updates/deletes in
spark?
![Page 60: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/60.jpg)
// Use KuduContext to create, delete, or write to Kudu tablesval kuduContext = new KuduContext("kudu.master:7051")
![Page 61: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/61.jpg)
// Insert datakuduContext.insertRows(df, "test_table")// Delete datakuduContext.deleteRows(filteredDF, "test_table")// Upsert datakuduContext.upsertRows(df, "test_table")// Update dataval alteredDF = df.select("id", $"count" + 1)kuduContext.updateRows(alteredDF, “test_table")http://kudu.apache.org/docs/developing.html
![Page 62: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/62.jpg)
Upserts are handled server side for performance
Upserts can also be handled through datasource api:
df.write.options(Map("kudu.master"-> "kudu.master:7051", "kudu.table"->"test_table")).mode("append").kudu
![Page 63: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/63.jpg)
You can also create, check existence and delete tables through
api
![Page 64: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/64.jpg)
Additional notes:Kudu datasource currently works with spark 1.xNext release it will support both 1.x and 2.xIt's being improved on regular basis
![Page 65: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/65.jpg)
Number of partitions on the dataframe is related to how many
tablets/partitions are related to the filter.
Partition scans are parallel and have locality awareness in spark
![Page 66: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/66.jpg)
Be sure to set spark locality wait to something other small for low
latency (3 seconds is the spark default)
![Page 67: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/67.jpg)
Spark Job Server (SJS)
![Page 68: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/68.jpg)
Created for low latency jobs on
spark
![Page 69: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/69.jpg)
Persistent contextsReduces runtime of hello world
type of job from 1 second to 10 ms
![Page 70: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/70.jpg)
Rest based api to:Run Jobs
Create contextsCheck status of job both
async/sync
![Page 71: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/71.jpg)
Creating a context calls spark submit (in separate jvm mode)
Uses akka to communicate between rest and spark driver
![Page 72: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/72.jpg)
To create a persistent context you need:
cpu cores + memory footprintname to reference it by
factory to use for the context.ie HiveContextFactory vs
SqlContextFactory
![Page 73: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/73.jpg)
Our average job time is 30ms when coming
through api for simpler retrievals
![Page 74: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/74.jpg)
Jobs need to implement an interface
context will be passed inDON’T CREATE YOUR OWN
SQLCONTEXT!!
![Page 75: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/75.jpg)
Currently only supports spark 1.x
2.x is coming soonish
![Page 76: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/76.jpg)
Keeps track of job runtimes in nice ui
along with additional metrics
![Page 77: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/77.jpg)
You can cache data and it will be available
to later jobs
![Page 78: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/78.jpg)
You can also load objects and they are available to
later jobsvia NamedObject interface
![Page 79: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/79.jpg)
Persistent context can be run in seperate JVM
or within SJS
![Page 80: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/80.jpg)
It does have some sharp edges
though...
![Page 81: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/81.jpg)
Due to jvm classloader contexts need to be
restarted on deploy to pick up new code
![Page 82: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/82.jpg)
Some settings:spark.files.overwrite = true context-per-jvm = true
spray-can: parsing.max-content-length = 256m
spray-can: idle-timeout = 600 sspray-can: request-timeout = 540 s
spark.serializer = "org.apache.spark.serializer.KryoSerializer"
filedao vs sqldao backendhave to build from source/no binary for SJS
![Page 83: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/83.jpg)
hive-site.xml<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:derby:memory:myDB;create=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.apache.derby.jdbc.EmbeddedDriver</value> <description>Driver class name for a JDBC metastore</description> </property>
![Page 84: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/84.jpg)
SparkThrift ServerExtended/Reused hive
thrift server
![Page 85: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/85.jpg)
I run the following on a persistent context:sc.getConf.set("spark.sql.hive.thriftServer.singleSession", "true")sqlContext.setConf("hive.server2.thrift.port", port) // port to run thrift server onHiveThriftServer2.startWithContext(sqlContext)
![Page 86: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/86.jpg)
Now I can connect using
hive-jdbcodbc (microsoft or
simba)
![Page 87: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/87.jpg)
Run a job with joins/ or even just a basic dataframe through
datasource api and registerTempTable
![Page 88: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/88.jpg)
val df = sqlContext.read.options(Map("kudu.master" -> "kudu.master:7051","kudu.table" -> “kudu_table")).kududf.registerTempTable
![Page 89: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/89.jpg)
You could also potentially cache/persist via spark and register that way assuming
joins are expensive
![Page 90: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/90.jpg)
Now you can run queries as if it was a traditional database
![Page 91: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/91.jpg)
Hey thats great, but how fast?500 ms average response time
200 concurrent complex queries1+ Billion rows with 200+ columns
sql queries with 5 predicates, min,max,count some values and group by on 5 columns
No spark caching
![Page 92: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/92.jpg)
We take this a step farther and do complex dataframes and it
is made available as a registered temp table
![Page 93: High concurrency,Low latency analyticsusing Spark/Kudu](https://reader036.vdocuments.site/reader036/viewer/2022062502/58a994bb1a28abc2518b4b87/html5/thumbnails/93.jpg)
Questions… If we run out of time send me questions on slack