cassandra day silicon valley - april 2014: building a flexible, real-time big data applications...

45
Building a Flexible, Real-time Big Data Applications Platform on Cassandra with Kiji Cassandra Day Silicon Valley 07 April 2014 Clint Kelly Member of Technical Staff WibiData 1

Upload: planet-cassandra

Post on 15-Jan-2015

755 views

Category:

Technology


0 download

DESCRIPTION

The Kiji Project is a modular, open-source framework that enables developers to efficiently build real-time Big Data applications. Kiji is built upon popular open-source technologies such as Cassandra, HBase, Hadoop, and Scalding, and contains components that implement functionality critical for Big Data applications, including the following: • Support for evolvable schemas of complex data types • Batch training of machine learning models with Hadoop • Real-time scoring with trained modelsIntegration with Hive and R • A REST endpoint Recently, we have updated Kiji to use Cassandra as a backing data store (previously, Kiji worked only with HBase). In this talk, we describe the process of integrating Cassandra and Kiji. Topics we cover include the following: • The Kiji architecture and data model • Implementing the Kiji data model in Cassandra using the Java driver and CQL3 • Integrating Cassandra with Hadoop 2.x • Building a flexible middleware platform that supports Cassandra and HBase (including projects that use both simultaneously) • Exposing unique features of Cassandra (e.g., variable consistency) to Kiji users

TRANSCRIPT

Page 1: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Building a Flexible, Real-time Big Data Applications Platform

on Cassandra with Kiji

Cassandra Day Silicon Valley07 April 2014

Clint KellyMember of Technical StaffWibiData

1

Page 2: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

2

Have this...

Want to build this...

!

Kiji

!

Page 3: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

3

History of the Kiji Project

• Created at WibiData• Originally built on top of HBase• Now works with Cassandra

One data model for two databases ➔ challenges!

Page 4: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Overview

• The Kiji Project• Kiji data model• Kiji on Cassandra

4

Page 5: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

The Kiji Project

5

Page 6: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Components of Kiji

6

Batch

Data storage

Real-time

Page 7: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji real-time components

• Score models• Manage models• REST interface

7

Hadoop, C*, HBase, Avro

KijiSchema

KijiMR KijiREST

KijiHive KijiScoring

KijiExpress

Page 8: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

• Expressive DSL• Machine learning library• Hive

8

Hadoop, C*, HBase, Avro

KijiSchema

KijiMR KijiREST

KijiHive KijiScoring

KijiExpress

Kiji batch components

Page 9: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema

• Serialization• Complex data types• Schema management

9

record UserLog { long timestamp; int user_id; string url; long session_id; array<string> terms;}

Hadoop, C*, HBase, Avro

KijiSchema

KijiMR KijiREST

KijiHive KijiScoring

KijiExpress

Page 10: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema

• Initially HBase-only• Now HBase and

Cassandra

10

Hadoop, C*, HBase, Avro

KijiSchema

KijiMR KijiREST

KijiHive KijiScoring

KijiExpress

Page 11: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

In production now

• Fortune 500 retailer: Personalized recommendations

• OPower: Energy usage and analytics reporting

11

Page 12: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji data model

12

Page 13: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

13

table

Page 14: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

14

table

rowrowrowrowrowrowrowrowrowrowrowrow

Page 15: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

row

15

Page 16: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

entity ID data

16

Row key = entity ID

Page 17: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

data0xfa “bob”

17

Composite entity IDs

Page 18: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

info0xfa “bob” songs

18

Data organized into column families

Page 19: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment

19

Column families contain columnsColumn name = qualifier

Page 20: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

20

Columns can have timestamped versions

Page 21: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

21

Column values can be complex data types

record SongPlay { long song_id; int user_rating; long session_id; device_type device;}

Page 22: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

info songsentity ID

22

Locality groups

Arrange data based on query pattern

Page 23: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

info songs_todayentity ID songs_prev_year

23

Locality groups

Arrange data based on query pattern

Need only one version of each column.

Need ASAP for real-time scoring; expires quickly.

Used for training ML algorithms in batch;

keep forever.

Page 24: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

info songs_todayentity ID songs_prev_year

24

Locality groups

Arrange data based on query pattern

Need only one version of each column.

Need ASAP for real-time scoring; expires quickly.

Used for training ML algorithms in batch;

keep forever.MAX_VERSIONS=1TTL=FOREVER

MAX_VERSIONS=INFINITETTL=”1 DAY”CACHED

MAX_VERSIONS=INFINITETTL=FOREVERCOMPRESSED

Page 25: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

KijiSchema• Similar to Cassandra, HBase, BigTable• Originally based on HBase ➔ timestamped versions• Logical and physical organization are separate• Complex data types

25

Page 26: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji on Cassandra

26

Page 27: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

info songs_todayentity ID songs_prev_year

Locality groups ➔ Tables

27

Locality group ~ query

CREATE TABLE loc_grp...

Page 28: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Entity ID ➔ Primary key

28

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

CREATE TABLE loc_grp (userid bigint, user text,

PRIMARY KEY (userid, user) )

WITH CLUSTERING ORDER BY (user ASC);

Page 29: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Family, Qualifier, Version ➔ Clustering Columns

29

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

CREATE TABLE loc_grp (userid bigint, user text,

family text, qualifier text, version bigint,

PRIMARY KEY (userid, user, family, qualifier, version) )

WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC, version DESC);

Page 30: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Column values ➔ Blobs

30

CREATE TABLE loc_grp (userid bigint, user text,

family text, qualifier text, version bigint, value blob,

PRIMARY KEY (userid, user, family, qualifier, version) )

WITH CLUSTERING ORDER BY (user ASC, family ASC, qualifier ASC, version DESC);

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

Page 31: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

31

cqlsh:kiji_music>SELECT * FROM kiji_table_users;

userid | user | family | qualifier | timestamp | value--------+------+--------+----------------+-----------+--------------- 123456 | bob | songs | abbey road | 139656012 | 0x81274b31032 123456 | bob | songs | help | 139625013 | 0x7c13270f129 123456 | bob | songs | help | 139621359 | 0x2307ff10370 123456 | bob | songs | help | 139625013 | 0x45e1822a497 123456 | bob | songs | helter skelter | 139621324 | 0x104bb974c34

Distinct Kiji column ➔ CQL row

Page 32: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Physical organization of data on disk

32

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

13965601230xfa:bob:info:email:t0:[email protected]

0xfa:bob:info:payment:t1:AMEX1234...

0xfa:bob:songs:let it be:t5:...

0xfa:bob:songs:let it be:t4:…

0xfa:bob:songs:let it be:t2:…

0xfa:bob:songs:help:t2:…

0xfa:bob:songs:helter skelter:t1:…

Efficient queries = continuous scans!

Page 33: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji queries ➔ CQL queries

33

Kiji queries can be complicated...

Page 34: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji queries ➔ CQL queries

All data in “info” column family for “bob” ➔SELECT qualifier, value FROM loc_grp_info WHERE userid=0xfa AND user=‘bob’ AND family=‘info’ LIMIT 1;

34

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

Page 35: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji queries ➔ CQL queries

Data in “info:email” and last play of “help” for “bob” ➔

SELECT value FROM lg_music WHERE userid=0xfa AND user=‘bob’ AND family=‘info’ AND qualifier=‘email’;

SELECT value FROM lg_music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’ AND qualifier=‘help’ LIMIT 1;

35

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

Page 36: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji queries ➔ CQL queries

All songs played by “bob” on April 2nd ➔SELECT qualifier, value FROM lg_music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’ AND timestamp >= 1396396800

AND timestamp <= 1396483200 ALLOW FILTERING;😱😱

36

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

Page 37: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Kiji queries ➔ CQL queries

37

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

!Bad Request: PRIMARY KEY part timestamp cannot be restricted (preceding part qualifier is either not restricted or by a non-EQ relation)

Page 38: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Tricky queries

• Filter in CQL where possible• Break up into multiple CQL queries• Filter on the client• Designing table layout is important

38

Page 39: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

MapReduce

• New InputFormat, OutputFormat• Java driver• Hadoop 2.x• Multiple C* queries per RecordReader

39

Page 40: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Project status

40

Page 41: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Initial release in ~2 weeks

41

www.kiji.org/getstarted

Page 42: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Next quarter

• Cassandra in all Kiji components• Expose Cassandra-specific features• Kiji support in CQLSH

42

Page 43: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Thanks to Cassandra community

• Great help on mailing lists for users, dev, java driver

• Webinars, meetups, C* Summit all available online

• Free training from DataStax• Very easy to get up-to-speed• Thanks to hosts and organizers today

43

Page 44: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

Try it now — Kiji Bento Box

• Latest compatible versions of all components• Hadoop, ZooKeeper, HBase• Cassandra in ~2 weeks

44

www.kiji.org/getstarted

http://jobs.wibidata.com/

Page 45: Cassandra Day Silicon Valley - April 2014: Building a flexible, real-time Big Data Applications platform on Cassandra

45