cassandra 3 new features 2016

45
@doanduyhai New Cassandra 3 Features DuyHai DOAN Apache Cassandra Evangelist

Upload: duyhai-doan

Post on 08-Jan-2017

808 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Cassandra 3 new features 2016

@doanduyhai

New Cassandra 3 FeaturesDuyHai DOANApache Cassandra Evangelist

Page 2: Cassandra 3 new features 2016

@doanduyhai

Who Am I ?Duy Hai DOAN Apache Cassandra Evangelist•  talks, meetups, confs …

•  open-source projects (Achilles, Apache Zeppelin ...)

•  OSS Cassandra point of contact• 

[email protected] ☞ @doanduyhai

2

Page 3: Cassandra 3 new features 2016

@doanduyhai

Datastax

•  Founded in April 2010

•  We contribute a lot to Apache Cassandra™

•  400+ customers (25 of the Fortune 100), 400+ employees

•  Headquarter in San Francisco Bay area

•  EU headquarter in London, offices in France and Germany

•  Datastax Enterprise = OSS Cassandra + extra features

3

Page 4: Cassandra 3 new features 2016

@doanduyhai

Agenda

4

•  Materialized Views

•  User Defined Functions (UDF) and Aggregates (UDA)

•  JSON Syntax

•  New SASI full text search index

Page 5: Cassandra 3 new features 2016

@doanduyhai

Materialized Views (MV)•  Why ? •  Gotchas

Page 6: Cassandra 3 new features 2016

@doanduyhai

Why Materialized Views ?•  Relieve the pain of manual denormalization

CREATE TABLE user(id int PRIMARY KEY, country text, …);

CREATE TABLE user_by_country( country text, id int, …, PRIMARY KEY(country, id));

6

Page 7: Cassandra 3 new features 2016

@doanduyhai

CREATE TABLE user_by_country ( country text, id int, firstname text, lastname text, PRIMARY KEY(country, id));

Materialzed View In ActionCREATE MATERIALIZED VIEW user_by_country AS SELECT country, id, firstname, lastnameFROM user WHERE country IS NOT NULL AND id IS NOT NULLPRIMARY KEY(country, id)

7

Page 8: Cassandra 3 new features 2016

Materialized Views Demo 8

Page 9: Cassandra 3 new features 2016

@doanduyhai

Materialized View Performance•  Write performance

•  slower than normal write•  local lock + read-before-write cost (but paid only once for all views)•  for each base table update, worst case: mv_count x 2 (DELETE + INSERT) extra

mutations for the views

9

Page 10: Cassandra 3 new features 2016

@doanduyhai

Materialized View Performance•  Write performance vs manual denormalization

•  MV better because no client-server network traffic for read-before-write •  MV better because less network traffic for multiple views (client-side BATCH)

•  Makes developer life easier à priceless

10

Page 11: Cassandra 3 new features 2016

@doanduyhai

Materialized View Performance•  Read performance vs secondary index

•  MV better because single node read (secondary index can hit many nodes)•  MV better because single read path (secondary index = read index + read data)

11

Page 12: Cassandra 3 new features 2016

@doanduyhai

Materialized Views Consistency•  Consistency level

•  CL honoured for base table, ONE for MV + local batchlog

•  Weaker consistency guarantees for MV than for base table.

12

Page 13: Cassandra 3 new features 2016

Q & A

! "

13

Page 14: Cassandra 3 new features 2016

@doanduyhai

User Define Functions (UDF)•  Why ? •  UDAs •  Gotchas

Page 15: Cassandra 3 new features 2016

@doanduyhai

Rationale•  Push computation server-side

•  save network bandwidth (1000 nodes!)•  simplify client-side code•  provide standard & useful function (sum, avg …)•  accelerate analytics use-case (pre-aggregation for Spark)

15

Page 16: Cassandra 3 new features 2016

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnTypeLANGUAGE language AS $$ // source code here$$;

16

Page 17: Cassandra 3 new features 2016

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnTypeLANGUAGE languageAS $$ // source code here$$;

Param name to refer to in the code Type = Cassandra type

17

Page 18: Cassandra 3 new features 2016

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnTypeLANGUAGE language // jAS $$ // source code here$$;

Always called Null-check mandatory in code

18

Page 19: Cassandra 3 new features 2016

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnTypeLANGUAGE language // javAS $$ // source code here$$;

If any input is null, function execution is skipped and return null

19

Page 20: Cassandra 3 new features 2016

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnType LANGUAGE languageAS $$ // source code here$$;

Cassandra types •  primitives (boolean, int, …) •  collections (list, set, map) •  tuples •  UDT

20

Page 21: Cassandra 3 new features 2016

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnTypeLANGUAGE languageAS $$ // source code here$$;

JVM supported languages •  Java, Scala •  Javascript (slow) •  Groovy, Jython, JRuby •  Clojure ( JSR 223 impl issue)

21

Page 22: Cassandra 3 new features 2016

UDF Demo

22

Page 23: Cassandra 3 new features 2016

@doanduyhai

User Defined Aggregates (UDA)•  Real use-case for UDF

•  Aggregation server-side à huge network bandwidth saving

•  Provide similar behavior for Group By, Sum, Avg etc …

23

Page 24: Cassandra 3 new features 2016

@doanduyhai

How to create an UDA ?CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS][keyspace.]aggregateName(type1, type2, …)SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction]INITCOND initCond;

Only type, no param name

State type

Initial state type

24

Page 25: Cassandra 3 new features 2016

@doanduyhai

How to create an UDA ?CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS][keyspace.]aggregateName(type1, type2, …)SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction]INITCOND initCond;

Accumulator function. Signature: accumulatorFunction(stateType, type1, type2, …)

RETURNS stateType

25

Page 26: Cassandra 3 new features 2016

@doanduyhai

How to create an UDA ?CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS][keyspace.]aggregateName(type1, type2, …)SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction]INITCOND initCond;

Optional final function. Signature: finalFunction(stateType)

26

Page 27: Cassandra 3 new features 2016

UDA Demo

27

Page 28: Cassandra 3 new features 2016

@doanduyhai

Gotchas

28

•  UDA in Cassandra is not distributed !

•  Do not execute UDA on a large number of rows (106 for ex.) •  single fat partition•  multiple partitions•  full table scan

•  à Increase client-side timeout•  default Java driver timeout = 12 secs

Page 29: Cassandra 3 new features 2016

@doanduyhai

Cassandra UDA or Apache Spark ?

29

Consistency Level

Single/Multiple Partition(s)

Recommended Approach

ONE Single partition UDA with token-aware driver because node local

ONE Multiple partitions Apache Spark because distributed reads

> ONE Single partition UDA because data-locality lost with Spark

> ONE Multiple partitions Apache Spark definitely

Page 30: Cassandra 3 new features 2016

Q & A

! "

30

Page 31: Cassandra 3 new features 2016

@doanduyhai

JSON Syntax•  Why ? •  Example

Page 32: Cassandra 3 new features 2016

@doanduyhai

Why JSON ?

32

•  JSON is a very good exchange format

•  But a terrible schema …

•  How to have best of both worlds ?•  use Cassandra schema•  convert rows to JSON format

Page 33: Cassandra 3 new features 2016

@doanduyhai

JSON syntax for INSERT/UPDATE/DELETE

33

CREATE TABLE users ( id text PRIMARY KEY,

age int, state text );

INSERT INTO users JSON '{"id": "user123", "age": 42, "state": "TX"}’;

INSERT INTO users(id, age, state) VALUES('me', fromJson('20'), 'CA');

UPDATE users SET age = fromJson('25’) WHERE id = fromJson('"me"');

DELETE FROM users WHERE id = fromJson('"me"');

Page 34: Cassandra 3 new features 2016

@doanduyhai

JSON syntax for SELECT

34

> SELECT JSON * FROM users WHERE id = 'me';[json]

---------------------------------------- {"id": "me", "age": 25, "state": "CA”}

> SELECT JSON age,state FROM users WHERE id = 'me';[json]

---------------------------------------- {"age": 25, "state": "CA"}

> SELECT age, toJson(state) FROM users WHERE id = 'me'; age | system.tojson(state) -----+---------------------- 25 | "CA"

Page 35: Cassandra 3 new features 2016

JSON Syntax Demo

35

Page 36: Cassandra 3 new features 2016

Q & A

! "

36

Page 37: Cassandra 3 new features 2016

@doanduyhai

SASI index, the search is over!•  Why ? •  How ? •  Who ? •  Demo ! •  When ?

Page 38: Cassandra 3 new features 2016

@doanduyhai

Why SASI ?•  Searching (and full text search) was always a pain point for Cassandra

•  limited search predicates (=, <=, <, > and >= only)•  limited scope (only on primary key columns)

•  Existing secondary index performance is poor•  reversed-index•  use Cassandra itself as index storage …

•  limited predicate ( = ). Inequality predicate = full cluster scan 😱

38

Page 39: Cassandra 3 new features 2016

@doanduyhai

How ?•  New index structure = suffix trees

•  Extended predicates (=, inequalities, LIKE %)

•  Full text search (tokenizers, stop-words, stemming …)

•  Query Planner to optimize AND predicates

•  NO, we don’t use Apache Lucene

39

Page 40: Cassandra 3 new features 2016

@doanduyhai

Who ?•  Open source contribution by an engineers team from …

40

Page 41: Cassandra 3 new features 2016

SASI Demo 41

Page 42: Cassandra 3 new features 2016

@doanduyhai

When ?•  Cassandra 3.5

•  Later•  support for OR clause : ( aaa OR bbb) AND (ccc OR ddd)•  index on collections (Set, List, Map)

42

Page 43: Cassandra 3 new features 2016

@doanduyhai

Comparison

43

SASI vs Solr/ElasticSearch ?•  Cassandra is not a search engine !!! (database = durability) •  always slower because 2 passes (SASI index read + original Cassandra data)•  no scoring •  no ordering (ORDER BY)•  no grouping (GROUP BY) à Apache Spark for analytics

Still, SASI covers 80% of search use-cases and people are happy !

Page 44: Cassandra 3 new features 2016

Q & A

! "

44

Page 45: Cassandra 3 new features 2016

@doanduyhai

[email protected]

https://academy.datastax.com/

Thank You

45