acunu analytics: simpler real-time cassandra apps

Post on 20-Jan-2015

1.049 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Talk for the Cassandra Seattle Meetup April 2013: http://www.meetup.com/cassandra-seattle/events/114988872/ Cassandra's got some properties which make it an ideal fit for building real-time analytics applications -- but getting from atomic increments to live dashboards and streaming queries is quite a stretch. In this talk, Tim Moreton, CTO at Acunu, talks about how and why they built Acunu Analytics, which adds rich SQL-like queries and a RESTful API on top of Cassandra, and looks at how it keeps Cassandra's spirit of denormalization under the hood.

TRANSCRIPT

Acunu Analytics: Simpler Real-Time Cassandra Apps

Tim Moreton CTO@timmoreton

Monday, 29 April 13

2

• Scalable. No single point of {failure, bottleneck}• Fast. Especially for writes•Available. Effortless Multi-DC support•Maturing fast. Lots of production deployments

WE C*

Monday, 29 April 13

3

WE C*

Virtual nodes CQL Support

Monday, 29 April 13

4

• Spartan queries •Thrift (and CQL, a bit) •Denormalization hurts agility •Weak update semantics

Challenges remain, of course.

WE C*

Monday, 29 April 13

5

C*: Two uses

Monday, 29 April 13

5

Session storage02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html

• Many more reads than writes

• Updates to existing records(ideally, transactionally)

• Probably fits in RAM:distribute for availability

C*: Two uses

Monday, 29 April 13

5

Real-time analytics

02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html

• Many more writes than reads

• Almost all reads are to results

• Almost no writes are ‘updates’

• Distribute for availability, performance, capacity

Session storage02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html

• Many more reads than writes

• Updates to existing records(ideally, transactionally)

• Probably fits in RAM:distribute for availability

C*: Two uses

Monday, 29 April 13

5

Real-time analytics

02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html

• Many more writes than reads

• Almost all reads are to results

• Almost no writes are ‘updates’

• Distribute for availability, performance, capacity

Session storage02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html02:44:02 241.24.41 0.0.1 GET /index.html

• Many more reads than writes

• Updates to existing records(ideally, transactionally)

• Probably fits in RAM:distribute for availability

C*: Two uses

Monday, 29 April 13

6

C* on

•Rich, SQL-like queries•RESTful HTTP APIs, JSON-based•Automated denormalization •Update semantics < less critical for analytics

Supplement Cassandra with:

Monday, 29 April 13

7

Analytics: Two patterns

Monday, 29 April 13

7

Exploratory Analytics

UnstructuredWarehouses

Data Mining

?Machine Learning

Analytics: Two patterns

Monday, 29 April 13

7

Exploratory Analytics

UnstructuredWarehouses

Data Mining

?Machine Learning

Analytics: Two patterns

Operational Intelligence

Dashboards Real-time Decisions

Alerting

!

Monday, 29 April 13

7

Exploratory Analytics

UnstructuredWarehouses

Data Mining

?Machine Learning

Analytics: Two patterns

Operational Intelligence

Dashboards Real-time Decisions

Alerting

!

Complex analysis, data varietyQuery richness

Data freshness, response timeQuery speed

Monday, 29 April 13

7

Exploratory Analytics

UnstructuredWarehouses

Data Mining

?Machine Learning

Analytics: Two patterns

Operational Intelligence

Dashboards Real-time Decisions

Alerting

!

Complex analysis, data varietyQuery richness

Data freshness, response timeQuery speed

Monday, 29 April 13

8

API

event stream

event store

roll-upcubes

Ingest Processing

dashboard queries programatic interface

Monday, 29 April 13

9

Who uses Acunu?

Location DataWeb and Visitor

Market/Tick Data

Infrastructure

Sensor Data

Social Media

Social GamingSmart Grid

Production Line

Monday, 29 April 13

10

Monday, 29 April 13

10

API

event stream

event store

roll-upcubes

Ingest Processing

dashboard queries programatic interfaceAPI

event stream

event store

roll-upcubes

Ingest Processing

dashboard queries programatic interface

Cassandra stores raw events and intermediate aggregates

Monday, 29 April 13

10

API

event stream

event store

roll-upcubes

Ingest Processing

dashboard queries programatic interfaceAPI

event stream

event store

roll-upcubes

Ingest Processing

dashboard queries programatic interface

Cassandra stores raw events and intermediate aggregates

API

event stream

event store

roll-upcubes

Ingest Processing

dashboard queries programatic interface

Acunu Analytics is a Cassandra client mapping new events, queries and schema changes to aggregate reads and writes

!

API

event stream

event store

roll-upcubes

Ingest Processing

dashboard queries programatic interface

Monday, 29 April 13

10

API

event stream

event store

roll-upcubes

Ingest Processing

dashboard queries programatic interfaceAPI

event stream

event store

roll-upcubes

Ingest Processing

dashboard queries programatic interface

Cassandra stores raw events and intermediate aggregates

Acunu Dashboards provides embeddable, custom data visualization using HTTP API

API

event stream

event store

roll-upcubes

Ingest Processing

dashboard queries programatic interface

Acunu Analytics is a Cassandra client mapping new events, queries and schema changes to aggregate reads and writes

!

API

event stream

event store

roll-upcubes

Ingest Processing

dashboard queries programatic interface

Monday, 29 April 13

CREATE TABLE APICalls (time TIME(‘PST’, HOUR, MIN, SEC),path PATH(/),useragent STRING,latitude DOUBLE(0.1, 0.01),longitude DOUBLE(0.1, 0.01)

);

CREATE CUBE SELECT COUNT, AVG(respTime) FROM APICalls WHERE time, path GROUP BY time, path;

CREATE CUBE SELECT COUNT FROM APICalls WHERE latitude, longitude GROUP BY latitude, longitude;

11

(Loosely) Define a schema

• Tables have HTTP endpoint; map to a set of ColumnFamilys• Dimensions map keys in events; allow hierarchical aggregation• Cubes defines dimensions and aggregate to maintain

Monday, 29 April 13

CREATE CUBE SELECT SUM(a) FROM t WHERE x, y GROUP BY g, h, i;

12

Aggregation

API

event stream

event store

roll-upcubes

Ingest Processing

dashboard queries programatic interface

Monday, 29 April 13

CREATE CUBE SELECT SUM(a) FROM t WHERE x, y GROUP BY g, h, i;

12

Aggregation

API

event stream

event store

roll-upcubes

Ingest Processing

dashboard queries programatic interface

New event:Apply SUM(v, v’) on this cell

vA: v’X: xY: yZ: z

y

x

(g, h, i)

Monday, 29 April 13

CREATE CUBE SELECT SUM(a) FROM t WHERE x, y GROUP BY g, h, i;

12

Aggregation

• Hierarchical dimensions cause multiple writes per event(That’s ok: Cassandra’s good at writes)

• Most aggregates result in atomic counter increments

API

event stream

event store

roll-upcubes

Ingest Processing

dashboard queries programatic interface

New event:Apply SUM(v, v’) on this cell

vA: v’X: xY: yZ: z

y

x

(g, h, i)

Monday, 29 April 13

SELECT SUM(a) FROM t WHERE x = .. and y = .. GROUP BY g, h, i;

13

Queries

API

event stream

event store

roll-upcubes

Ingest Processing

dashboard queries programatic interface

• WHEREs map to a Cassandra row and GROUP BY to a compound column key in that row (very roughly)

Monday, 29 April 13

SELECT SUM(a) FROM t WHERE x = .. and y = .. GROUP BY g, h, i;

13

Queries

API

event stream

event store

roll-upcubes

Ingest Processing

dashboard queries programatic interface

New query:

• Locate slice that matches WHERE

• Return all mappings from GROUP BY tuples to cell values

vy

x

(g, h, i)

• WHEREs map to a Cassandra row and GROUP BY to a compound column key in that row (very roughly)

Monday, 29 April 13

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3221 :00→22 :01→19 :02→104 ...

... ...

UK all→228 user01→1 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1904 ...

∅ all→87314 UK→238 US→354 ...

14

A concrete example

Monday, 29 April 13

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→355 ...

{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,

}

15

Each event updates multiple aggregates:

A concrete example

Monday, 29 April 13

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→355 ...

{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,

}

15

Each event updates multiple aggregates:

WHERE time IN (22:00,23:00)GROUP BY minute

A concrete example

Monday, 29 April 13

21:00 all→1345 :00→45 :01→62 :02→87 ...

22:00 all→3222 :00→22 :01→19 :02→105 ...

... ...

UK all→229 user01→2 user14→12 user99→7 ...

US all→354 user01→4 user04→8 user56→17 ...

...

UK, 22:00 all→1905 ...

∅ all→87315 UK→239 US→355 ...

{cust_id: user01,session_id: 102,geography: UK,browser: IE,time: 22:02,

}

15

Each event updates multiple aggregates:

WHERE time IN (22:00,23:00)GROUP BY minute

WHERE geography=US GROUP BY user

A concrete example

Monday, 29 April 13

16

SELECT `SUM(x)/(MAX(y) - MIN(y) + 0.5) AS 'spread' FROM ...

Arithmetic expressions

SELECT a - b AS lbound, a + b AS ubound FROM (SELECT AVG(score) AS a FROM scores WHERE year = 2012) JOIN (SELECT STDDEV(score) AS b FROM scores) USING (school)

Fast inner joins

SELECT COUNT UNIQUE (visitors) GROUP BY time(DAY(‘US/Pacific’))

Time zone support

SELECT SUM(size) FROM ..WHERE path MATCHES /usr/*

Hierarchical aggregationSELECT DRILL FROM errors WHERE category IN (“warn”, “error”)

Drill down to raw events

SELECT COUNT (items) FROM ..GROUP BY category LIMIT 3, country

... HAVING AVG(rating) < 2.0 AND COUNT >= 10

Limits

Query-time filtering

Rich queries

Monday, 29 April 13

17

Monday, 29 April 13

Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and elephant logos are trademarks of the Apache Software Foundation.

Thank You.

Tim Moreton CTO@timmoreton

Monday, 29 April 13

top related