orientdb - time series and event sequences - codemotion milan 2014

Post on 10-Jul-2015

2.191 Views

Category:

Data & Analytics

10 Downloads

Preview:

Click to see full reader

DESCRIPTION

Managing event sequences and time series with a Document-Graph Database

TRANSCRIPT

Time flows, my friend

Luigi Dell’Aquila

Orient Technologies LTD

Twitter: @ldellaquila

Managing event sequences and time series with a Document-Graph Database

Codemotion Milan 2014

Time What…?

Time What…?

Time series:

A time series is a sequence of data points, typicallyconsisting of successive measurements made over a time interval (Wikipedia)

Time What…?

Event sequences:

• A set of events with a timestamp

• A set of relationships “happenedbefore/after”

• Cause and effect relationships

Time What…?

Time as a dimension:

• Direct:

– Eg. begin and end of relationships (I’m a friend of John since…)

• Calculated

– Eg. Speed (distance/time)

Time What…?

Time as a constraint:

• Query execution time!

The problem:Fast and Effective

Fast and Effective

Fast write: Time doesn’t wait! Writes just arrive

Fast read: a lot of data to be read in a short time

Effective manipulation: complex operations like

- Aggregation

- Prediction

- Analysis

Current approaches

Current approaches

0. Relational approach: table

Timestamp Value

2014:11:21 14:35:00 1321

2014:11:21 14:35:01 2444

2014:11:21 14:35:02 2135

2014:11:21 14:35:03 1833

Current approaches

0. Relational approach: table

HH MM SS Value

14 35 0 1321

14 35 1 2444

14 35 2 2135

14 35 3 1833

Current approaches

0. Relational – Advantages

• Simple

• It can be used together with your application data (operational)

Current approaches

0. Relational – Disadvantages

• Slow read (relies on an index)

• Slow insert (update the index…)

Current approaches

1. Document Database

• Collections of Documents instead of tables

• Schemaless

• Complex data structures

Current approaches

1. Document approach: Minute Based

{

timestamp: “2014-11-21 12.05“

load: [10, 15, 3, … 30] //array of 60, one per second

}

Current approaches

1. Document approach: Hour Based

{

timestamp: “2014-11-21 12.00“

load: {

0: [10, 15, 3, … 30], //array of 60, one per second

1: [0, 12, 31, … 24],

59: [10, 10, 1, … 16]

}

}

Current approaches

1. Document approach – Advantages

• Fast write: One insert x 60 updates

• Fast fetch

Current approaches

1. Document approach – Disadvantages

• Fixed time windows

• Single point per unit

• How to pre-aggregate?

• Relationships with the rest of the world?

• Relationships between events?

Current approaches

2. Graph Database

• Nodes/Edges instead of tables

• Index free adjacency

• Fast traversal

• Dynamic structure

Current approaches

2. Graph approach: linked sequence

e1

e2

nexte3

next e4

nexte5

next

(timestamp on vertex)

Current approaches

2. Graph approach: linked sequence (tagbased)

e1

e2

nextTag1

e3

nextTag2

e4

nextTag1

e5

nextTag1

nextTag2

[Tag1, Tag2] [Tag1]

[Tag1, Tag2]

[Tag1]

[Tag2]

Current approaches

2. Graph approach: Hierarchy

e1

e2

e60

1

1

8

24

2 60…

Days

Hours

Minutes

Seconds

e3

Current approaches

2. Graph approach: mixed

e1

e2

e60

1

1

8

24

2 60…

Days

Hours

Minutes

Seconds

e3

Current approaches

1. Graph approach – Advantages

• Flexible

• Events can be connected together in different ways

• You can connect events to other entities

• Fast traversal of dynamic time windows

• Fast aggregation (based on hierarchy)

Current approaches

1. Graph approach – Disadvantages

• Slow writes (vertex + edge + maintenance)

• Not so fast reads

Can we mix different models and getall the advantages?

Can we mix all this with the rest of application logic?

Multi-Model!

• Document database (schema-free, complexproperties)

• Graph database (index-free adjacency, fast traversal)

• SQL (extended)• Operational (schema - ACID)• OO concepts (Classes, inheritance, polymorphism)• REST/JSON interface• Native Javascript (extend query language, expose

services, event hooks)• Distributed (Multi-master replica/sharding)

architecture

OrientDB

First step: put them together

1

1

8

24

2 60…

Days

Hours

Minutes

{0: 1000,1: 1500.…59: 96

}

OrientDB

First step: put them together

1

1

8

24

2 60…

Days

Hours

Minutes

{0: 1000,1: 1500.…59: 96

}

Graph

Document <- IT’S A VERTEX TOO!!!

OrientDB

First step: put them together

1

8

24

Days

Hours…

{0: {

0: 1000, 1: 1500,…59: 210

}1: { … }…59: { … }

}

Graph

Document

Where should I stop?

It depends on my domain and requirements.

OrientDB

Result:

• Same insert speed of Document approach

• But with flexibility of a Graph

• (as a side effect of mixing models, documents can also contain “pointers” to other elements of app domain)

OrientDB

Second step: Pre-aggregate

1

1

8

24

2 60…

Days

Hours

Minutes

{0: 1000,1: 1500.…59: 96

}

Graph

Document <- IT’S A VERTEX TOO!!!

OrientDB

Second step: Pre-aggregate

1

1

8

24

2 60…

Days

Hours

Minutes

{0: 1000,1: 1500.…59: 96

}

Graph

Document <- IT’S A VERTEX TOO!!!

sum()

OrientDB

Second step: Pre-aggregate

1

1

8

24

2 60…

Days

Hours

Minutes

{0: 1000,1: 1500.…59: 96

}

Graph

Document <- IT’S A VERTEX TOO!!!

sum()

sum()

OrientDB

How to aggregate

Hooks: Server side triggers (Java or Javascript), executed when DB operations happen (eg. Insertor update)

Java interface:

Public RESULT onBeforeInsert(…);

public void onAfterInsert(…);

public RESULT onBeforeUpdate(…);

public void onAfterUpdate(…);

OrientDB

Aggregation logic

• Second 0 -> insert

• Second 1 -> update

• …

• Second 57 -> update

• Second 58 -> update

• Second 59 -> update + aggregate

– Write aggregate value on minute vertex• Minute == 59? Calculate aggregate on hour vertex

OrientDB

1

1

8

24

2 60…

Days

Hours

Minutes

{0: 1,1: 12.…59: 3

}

sum = 1000

sum = 15000

sum = 300

incomplete

complete

1 2

sum = null

sum = null

OrientDB

Query logic:

• Traverse from root node to specified level(filtering based on vertex data)

• Is there aggregate value?

– Yes: return it

– No: go one level down and do the same

Aggregation on a level will be VERY fast if youhave horizontal edges!

OrientDB

How to calculate aggregate values with a query

Input params:

- Root node (suppose it is #11:11)

select sum(aggregateVal) from (

traverse out() from #11:11

while in().aggregateVal is null

)

With the same logic you can query based on time windows

OrientDB

Third step: Complex domains

1

1 2 60…

Hours

Minutes

{0: {val: 1000},1: {val: 1500}.…59: {

val: 96,eventTags: [tag1, tag2]…

}}

Graph

Document <- Enrich the domain

OrientDB

Another use case: Event Categories and OO

e1

e2

nextTag1

e3

nextTag2

e4

nextTag1

e5

nextTag1

nextTag2

[Tag1, Tag2, Tag3] [Tag1]

[Tag1, Tag2]

[Tag1]

[Tag2]

nextTag3

e3

[Tag3]

OrientDB

Another use case: Event Categories and OO

Suppose tags are hierarchical categories(Classes for vertices and/or edges)

nextTAG

nextTagX nextTag3

nextTag2nextTag1

OrientDB

Subset of events

TRAVERSE out(‘nextTag1’) FROM <e1>

e1

e2

nextTag1e4

nextTag1

e5

nextTag1

[Tag1, Tag2, Tag3] [Tag1]

[Tag1, Tag2]

[Tag1]

OrientDB

Subset of events

TRAVERSE out(‘nextTag2’) FROM <e1>

e1

nextTag1

e3

nextTag2 e5

nextTag2

[Tag1, Tag2, Tag3]

[Tag1, Tag2]

[Tag2]

OrientDB

Subset of events (Polymorphic!!!)

TRAVERSE out(‘nextTagX’) FROM <e1>

e1

e2

nextTag1

e3

nextTag2

e4

nextTag1

e5

nextTag1

nextTag2

[Tag1, Tag2, Tag3] [Tag1]

[Tag1, Tag2]

[Tag1]

[Tag2]

Connect all this with the rest of yourapplication domain

You’ll see, everything will get more complex: you will discover new time-

related dimensions (speed, position…) and new needs (complex

forecasting)

CHASE!

Chase

• Your target is running away

• You have informers that track his moves(coordinates in a point of time) and giveyou additional (unstructured) information

• You have a street map

• You want to:

– Catch him ASAP

– Predict his moves

– Be sure that he is inside an area

Chase

Chase

Chase

• Map is made of points and distances

• You also have speed limits for streets

point1

pointN Distance: 1KmMax speed: 70Km/h

Distance: 2KmMax speed: 120Km/h

Distance: 8KmMax speed: 90Km/h

Street

Map point

Chase

• Map is made of points and distances

• You also have speed limits for streets

• Distance / Speed = TIME!!!

Chase

You have a time series of your target’s moves

{Timestamp: 29/11/2014 17:15:00LAT: 19,12223LON: 42,134

}

{Timestamp: 29/11/2014 17:55:00LAT: 19,12223LON: 42,134

}

Event seqence

Event

{Timestamp: 29/11/2014 17:55:00LAT: 19,12223LON: 42,134

}

Chase

You have a time series of your target’s moves

21/11/20142:35:00 PM

20/11/20141:20:00 PM

Street

Map point

Chase

You have a time series of your target’s moves

21/11/201414:35:00

20/11/201413:20:00

Where

Event seqence

Street

Event

Map point

29/11/201417:55:00

Chase

Vertices and edges are also documents

So you can store complex information inside them

{

timestamp: 22213989487987,

lat: xxxx,

lon: yyy,

informer: 15,

additional: {

speed: 120,

description: “the target was in a car”

car: {

model: “Fiat 500”,

licensePlate: “AA 123 BB”

}

}

}

Chase

Now you can:

• Predict his moves (eg. statistical methods, interpolation on lat/lon + time)

• Calculate how far he can be (based on last position, avg speed and street data)

• Reach him quickly (shortest path, Dijkstra)

• … intelligence?

Chase

But to have all this you need:

• An easy way for your informers to sendtime series events

Hint: REST interface

With OrientDB you can expose Javascriptfunctions as REST services!

Chase

And you need:

• An extended query language

Eg. TRAVERSE out(“street”) FROM (

SELECT out(“point”) FROM #11:11

// my last event

) WHILE canBeReached($current, #11:11)

(where he could be)

Chase

With OrientDB you can write

function canBeReached(node, event)

In Javascript and use it in your queries

Chase

It’s just a game, but think about:

• Fraud detection

• Traffic routing

• Multi-dimensional analytics

• Forecasting

• …

Summary

One model is not enough

One of most common issues of my customersis:

“I have a zoo of technologies in my applicationstack, and it’s getting worse every day”

My answer is: Multi-Model DB

One model is not enough

One of most common issues of my customersis:

“I have a zoo of technologies in my applicationstack, and it’s getting worse every day”

My answer is: Multi-Model DB

of course ;-)

From:“choose the right data model for your

use case”

To:“Your application has multiple data

models, you need all of them!”

This is NoSQL 2.0!!!

Thank you!

@ldellaquila

l.dellaquila@orientechnologies.com

top related