modeling the iot with titandb and cassandra

Modeling the IoT with TitanDB and Cassandra

• Ted Wilmes

• Data warehouse engineer at WellAware - wellaware.us

• Building a SaaS oil and gas production monitoring and analytics platform

• Collect production O&G data from the field via cellular, satellite, and other means and deliver to our customers via mobile and browser clients

1The property graph model and TitanDB

2Modeling IoT

3Time series and performance

Property graph model

person name: Ted

person name: George

metOn: June 1,2012

Querying with Gremlin

person name: Ted

person name: George

metOn: June 1,2012

g.V().hasLabel(“person”).has(“name”, “Ted”).out(“knows”).values(“name”)

> George

TitanDB

• Graph database that supports pluggable storage layers

• Designed from the ground up to provide OLTP performance against large graphs with a particular focus on supporting high degree vertices (vertices with many edges)

• Implements Apache TinkerPop 3 APIs

• Cassandra acts as solid foundation providing high availability, performance, and ease of operation

Our Internet of Things

Things People

OrganizationsPlaces

A hypothetical use case: IoT…

in SPACE

Spaceship

Mars Base

Space Station

Rocket

Satellite

Many dimensions

Rocket

Starfleet

Acme Rockets

Delta Booster

operates

builds

isModel

Major Tom

pilots Joyce

maintains

Many times, a “thing” is a system of systems

http://stardust.jpl.nasa.gov/mission/delta2.html

Rocket

1st Stage 2nd Stage3rd Stage

Interstage

Fuel Tank

Oxidizer

Guidance Electronics

CPU Memory

Memory JVM

Heap Usage

Thread Count

Continuing to zoom in

Heap Usage

Thread Count

Alarm Condition

triggers

notifiesmonitors

Major Tom

Alarm Condition

triggers

notifies

reports to

Starfleet

employs

IoT modeling in summary

• Things can be interconnecting systems of other things

• High fidelity model of ‘reality’ supports wide variety of use cases vs. a disconnected set of entities

• IoT app is really only one part about things, don’t forget to include everything else! (social, organizational, etc.)

Time series & Performance

Memory JVM

Heap Usage

Thread Count

Time series in Titan

Heap Usage

Our basic time series requirements

• Support a large volume of low latency writes

• Low latency retrieval on primarily the most recent data

A selection of factors affecting Titan performance

• Titan deployment topology and configuration • All your usual Cassandra tuning tips and tricks • Titan JVM tuning

• selection of appropriate garbage collector • GC parameters • like Cassandra, worthwhile to adjust NewSize

• Data modeling • Indexing

• Global graph indices (native Titan vs. external) • Vertex centric indices

• Titan different caches - transaction cache & the database-level cache

A selection of factors affecting Titan performance

• Titan deployment topology and configuration • All your usual Cassandra tuning tips and tricks • Titan JVM tuning

• selection of appropriate garbage collector • GC parameters • like Cassandra, worthwhile to adjust NewSize

• Data modeling • Indexing

• Global graph indices • Vertex centric indices

• Titan different caches - transaction cache & the database-level cache

Deployment options

mars-north-1Local

Embedded

Remote

But first, time series with CQL

* Brady Gentile - https://academy.datastax.com/demos/getting-started-time-series-data-modeling

But first, time series with CQL

* Brady Gentile - https://academy.datastax.com/demos/getting-started-time-series-data-modeling

First approachHeap Usage

chunkStart: 1442880000000chunkEnd: 1442966400000

Observation Observation

tstamp: 1442880000001 tstamp: 1442880000002

• Intuitive and easy to query • You can imagine adding further

levels to the hierarchy following a year->month->day format

• Individual observations can be associated with other pieces of data

• Observations can be filtered by timestamp with edge filter but you still have to retrieve a large number of disparate vertices

A further refinement

Heap Usage

• How do we reduce the number of vertices (think Cassandra partitions) that we need to retrieve?

timestamp value

1. Move all properties to the edge 2. Make the edge “undirected” or, a combo of the two approaches 1. Copy the properties to the edge 2. Keep the discrete observation

vertex

tstamp value

Heap Usage

Chunk vertex with its observations

Vertex ID chunkStart chunkEnd obs. @ t2 obs. @ t1 obs. @ t0

Observations in time descending order

Sample Gremlin queries

• observations > 1442162072000 • chunk.outE().has(“tstamp”, gt(1442162072000))

• observations between 1442162072000 and 1442162073000 • chunk.outE().has(“tstamp”, between(1442162072000, 1442162073000))

• Most recent observation before now • chunk.outE().has(“tstamp”, lte(System.currentTimeMillis()).

order().by(“tstamp”, decr).limit(1)

• You can wrap this in your own time series specific API • new SeriesQuery(series1).interval(startTstamp, endTstamp).decr().limit(1)

Pros and cons vs. separate CQL or other tsdb

• Pros • Allows for a single unified view of your IoT data, maintaining

direct connectivity between sensor data & the other entities • Gremlin works well for processing streams of time series

data • Cons

• Storage format is not as compact • Extra overhead of managing ‘chunks’ versus CQL primary

key taking care of that for us (eg. chunk cache)

Heap Usage

label: hasChunk chunkStart: 1442880000000chunkEnd: 1442966400000

label: hasChunk chunkStart: 1442966400000chunkEnd: 1442966400000

A simple query - retrieve all the heap usage chunks

gremlin> g.V(4).out(‘hasChunk’).values(‘chunkStart’) ==> 1442880000000 ==> 1442966400000

Getting a vertex by id

gremlin> g.V(4) ==>v[4]

Does this vertex exist?

Vertex is now loaded in Titan transaction cache

Aside - a tool of the trade

Profiler with socket tracing

Retrieving properties

gremlin> g.V(4).valueMap() ==>[sensorType:[heap usage], units:[bytes]]

Two properties

Retrieve properties

Vertex properties are now loaded in the Titan transaction cache

2 Round trips

Two properties

Retrieve properties

• Not a big deal for single vertex lookup with property retrieval but can add up

• Exacerbated by magnitude of latency between Titan and Cassandra

Querying for adjacent vertices

gremlin> g.V(4).out(‘hasChunk’).values(‘chunkStart’) ==> 1442880000000 ==> 1442966400000 Does this vertex exist?

Get 1st chunk properties

Get edges

Get 2nd chunk properties

Batch requests

Get 1st chunk properties Get 2nd chunk properties

Get edges

• query.batch = true • “Whether traversal queries should

be batched when executed against the storage backend. This can lead to significant performance improvement if there is a non-trivial latency to the backend.” - http://s3.thinkaurelius.com/docs/titan/0.9.0-M2/titan-config-ref.html

Remove initial exists query

• storage.batch-loading = true • WARNING - this disables

vertex ‘exists’ checks

Get 1st chunk properties Get 2nd chunk properties

Get edges

Optimizing your write

gremlin> chunk.addEdge(“hasObservation”, chunk, “tstamp”, 1442162072000, “value”, 500.123)

Write new edge

Optimizing your writes

gremlin> chunk.addEdge(“hasObservation”, chunk, “tstamp”, 1442162072000, “value”, 500.123)

Write new edge

• Remove the read from your write path - storage.batch-loading = true

• batch your commits, measure latency and throughput on your system to find a good commit size

storage.batch-loading=false

storage.batch-loading=true

Quick and dirty write performance numbers

22,500

45,000

67,500

90,000

• 9 m3.2xlarge nodes w/ C* 2.2, RF = 3, writing @ quorum, default C* settings • 1 m3.2xlarge “client” w/ Titan 1.0-SNAPSHOT, 10 write threads writing 100

million points in total across 100,000 series

In summary

• Understanding of underlying data storage format can help with performance tuning

• Writes • remove reads from the write path where possible • test different batch commit sizes • when writing vertices you may need to adjust ids.block-size and

ids.renew-percentage • Reads

• batch communication between Titan and Cassandra with query.batch=true

• make use of global and vertex centric indices when possible

What questions do you have and thanks!

Thanks to the Apache TinkerPop,TitanDB team, my awesome coworkers, and the folks at DataStax for putting on an excellent summit!

Ted Wilmes Data Warehouse Engineer

@trwilmes tedwilmes@wellaware.us

Thank you

modeling the iot with titandb and cassandra

Software

data stax webinar cassandra and titandb insights into...

cassandra at ebay - cassandra summit 2012

running cassandra on amazon’s ecs -...

cassandra and iot

solr & cassandra: searching cassandra with datastax...

cassandra freeman - thoughtful...

datastax et apache cassandra pour la gestion des flux iot

cassandra summit 2014: cassandra compute cloud: an elastic...

associate)professor)cassandra)l.atherton) deakin...

cassandra core concepts - cassandra day toronto

cassandra cluster management by japan cassandra community

la cassandra day 2015 - cassandra for developers

cassandra summit eu 2014 - testing cassandra applications

intro to graph databases using tinkerpop, titandb, and...

a guide to stress testing kafka, spark and cassandra … ·...

chicago cassandra - cassandra from python

cassandra at ebay - cassandra summit 2013

microsoft azure iot services reference...

state of cassandra, 2012 - nosql | apache cassandra ·...

apache cassandra™...