realtime analytics on the twitter firehose with cassandra

20
Realtime Analytics with Cassandra or: How I Learned to Stopped Worrying and Love Counting 1

Upload: acunu

Post on 12-May-2015

2.081 views

Category:

Technology


1 download

DESCRIPTION

Tutorial given by Tom Wilkie at Progressive NoSQL conference, 11/5/12

TRANSCRIPT

Page 1: Realtime Analytics on the Twitter Firehose with Cassandra

Realtime Analytics with Cassandra

or: How I Learned to Stopped Worrying and

Love Counting

1

Page 2: Realtime Analytics on the Twitter Firehose with Cassandra

What is Realtime Analytics?eg “show me the number of mentions of

‘Acunu’ per day, between May and November 2011, on Twitter”

Batch (Hadoop) approach would require processing ~30 billion tweets,

or ~4.2 TB of datahttp://blog.twitter.com/2011/03/numbers.html

2

Page 3: Realtime Analytics on the Twitter Firehose with Cassandra

Introduction

3

Live & historicalaggregates...

3

Page 4: Realtime Analytics on the Twitter Firehose with Cassandra

4

Realtime trends...

4

Page 5: Realtime Analytics on the Twitter Firehose with Cassandra

5

Drill downsand roll ups

5

Page 6: Realtime Analytics on the Twitter Firehose with Cassandra

Okay, so how are we going to do it?

For each tweet,

increment a bunch of counters,

such that answering a query

is as easy as reading some counters

6

Page 7: Realtime Analytics on the Twitter Firehose with Cassandra

Preparing the dataStep 1: Get a feed of

the tweets

Step 2: Tokenise the tweet

Step 3: Increment countersin time buckets for each token

12:32:15 I like #trafficlights12:33:43 Nobody expects...

12:33:49 I ate a #bee; woe is...12:34:04 Man, @acunu rocks!

[1234, man] +1[1234, acunu] +1[1234, rock] +1

7

Page 8: Realtime Analytics on the Twitter Firehose with Cassandra

Querying

Step 1: Do a range query

Step 2: Result table

Step 3: Plot pretty graph

start: [01/05/11, acunu]end: [30/05/11, acunu]

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

0

45

90

May Jun Jul Aug Sept Oct Nov

8

Page 9: Realtime Analytics on the Twitter Firehose with Cassandra

Except it’s not that easy...• Cassandra best practice is to use RandomPartitioner,

so not possible to range queries on rows

• Could manually work out each row in range, do lots of point gets

• This would suck - each query would be 100’s of random IOs on disk

• Need to use wide rows, range query is a column slice, each query ~1 IO - Denormalisation

9

Page 10: Realtime Analytics on the Twitter Firehose with Cassandra

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

So instead of this...

We do thisKey 00:01 00:02 ...

[01/05/11, acunu] 3 5 ...

[02/05/11, acunu] 12 4 ...

... ... ...

Row key is ‘big’ time bucket

Column key is ‘small’ time bucket

10

Page 11: Realtime Analytics on the Twitter Firehose with Cassandra

Demo./painbird.py -u tom_wilkie

11

Page 12: Realtime Analytics on the Twitter Firehose with Cassandra

Now its your turn.....

12

Page 13: Realtime Analytics on the Twitter Firehose with Cassandra

1. Get a twitter account - http://twitter.com

2. Get some Cassandra VMs - http://goo.gl/O9hkv

3. Cluster them up

4. Get the code - http://goo.gl

5. Implement the missing bits!

6. (Prizes for the ones that spot bugs!)

13

Page 14: Realtime Analytics on the Twitter Firehose with Cassandra

http://goo.gl/O9hkv

Get some Cassandra VMs

14

Page 15: Realtime Analytics on the Twitter Firehose with Cassandra

Cluster them up

• SSH in, set password (on both!)

• Check you can connect to the UI

• Use UI (click add host)

15

Page 16: Realtime Analytics on the Twitter Firehose with Cassandra

Get the codeSSH into one of the VMs:

# curl https://acunu-oss.s3.amazonaws.com/painbird.tar.gz | tar zxf -

# curl -o pycassa.rpm https://acunu-oss.s3.amazonaws.com/pycassa.rpm

# rpm -i pycassa.rpm

# cd release

# ./painbird.py -u tom_wilkie

16

Page 17: Realtime Analytics on the Twitter Firehose with Cassandra

Implement the “core”

• In core.py

• def insert_tweet(cassandra, tweet):

• def do_query(cassandra, term, start, finish):

17

Page 18: Realtime Analytics on the Twitter Firehose with Cassandra

Check you data-bash-3.2$ cassandra-cli Connected to: "Test Cluster" on localhost/9160Welcome to Cassandra CLI version 1.0.8.acunu2Type 'help;' or '?' for help.Type 'quit;' or 'exit;' to quit.

[default@unknown] use painbird;Authenticated to keyspace: painbird[default@painbird] list keywords;Using default limit of 100-------------------RowKey: m-5-"woe=> (counter=11, value=1)

18

Page 19: Realtime Analytics on the Twitter Firehose with Cassandra

Extensions

19

Page 20: Realtime Analytics on the Twitter Firehose with Cassandra

UI

• Pretty graphs

• Automatically periodically update

• Search multiple terms

Painbird

• mentions of multiple terms

• sentiment analysis - http://www.nltk.org/

20