cassandra 3 new features 2016

@doanduyhai

New Cassandra 3 FeaturesDuyHai DOANApache Cassandra Evangelist

@doanduyhai

Who Am I ?Duy Hai DOAN Apache Cassandra Evangelist•  talks, meetups, confs …

•  open-source projects (Achilles, Apache Zeppelin ...)

•  OSS Cassandra point of contact• 

☞ [email protected] ☞ @doanduyhai

2

@doanduyhai

Datastax

•  Founded in April 2010

•  We contribute a lot to Apache Cassandra™

•  400+ customers (25 of the Fortune 100), 400+ employees

•  Headquarter in San Francisco Bay area

•  EU headquarter in London, offices in France and Germany

•  Datastax Enterprise = OSS Cassandra + extra features

3

@doanduyhai

Agenda

4

•  Materialized Views

•  User Defined Functions (UDF) and Aggregates (UDA)

•  JSON Syntax

•  New SASI full text search index

@doanduyhai

Materialized Views (MV)•  Why ? •  Gotchas

@doanduyhai

Why Materialized Views ?•  Relieve the pain of manual denormalization

CREATE TABLE user(id int PRIMARY KEY, country text, …);

CREATE TABLE user_by_country( country text, id int, …, PRIMARY KEY(country, id));

6

@doanduyhai

CREATE TABLE user_by_country ( country text, id int, firstname text, lastname text, PRIMARY KEY(country, id));

Materialzed View In ActionCREATE MATERIALIZED VIEW user_by_country AS SELECT country, id, firstname, lastnameFROM user WHERE country IS NOT NULL AND id IS NOT NULLPRIMARY KEY(country, id)

7

Materialized Views Demo 8

@doanduyhai

Materialized View Performance•  Write performance

•  slower than normal write•  local lock + read-before-write cost (but paid only once for all views)•  for each base table update, worst case: mv_count x 2 (DELETE + INSERT) extra

mutations for the views

9

@doanduyhai

Materialized View Performance•  Write performance vs manual denormalization

•  MV better because no client-server network traffic for read-before-write •  MV better because less network traffic for multiple views (client-side BATCH)

•  Makes developer life easier à priceless

10

@doanduyhai

Materialized View Performance•  Read performance vs secondary index

•  MV better because single node read (secondary index can hit many nodes)•  MV better because single read path (secondary index = read index + read data)

11

@doanduyhai

Materialized Views Consistency•  Consistency level

•  CL honoured for base table, ONE for MV + local batchlog

•  Weaker consistency guarantees for MV than for base table.

12

Q & A

! "

13

@doanduyhai

User Define Functions (UDF)•  Why ? •  UDAs •  Gotchas

@doanduyhai

Rationale•  Push computation server-side

•  save network bandwidth (1000 nodes!)•  simplify client-side code•  provide standard & useful function (sum, avg …)•  accelerate analytics use-case (pre-aggregation for Spark)

15

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnTypeLANGUAGE language AS $$ // source code here$$;

16

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnTypeLANGUAGE languageAS $$ // source code here$$;

Param name to refer to in the code Type = Cassandra type

17

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnTypeLANGUAGE language // jAS $$ // source code here$$;

Always called Null-check mandatory in code

18

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnTypeLANGUAGE language // javAS $$ // source code here$$;

If any input is null, function execution is skipped and return null

19

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnType LANGUAGE languageAS $$ // source code here$$;

Cassandra types •  primitives (boolean, int, …) •  collections (list, set, map) •  tuples •  UDT

20

@doanduyhai

How to create an UDF ?CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS][keyspace.]functionName (param1 type1, param2 type2, …)CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUTRETURNS returnTypeLANGUAGE languageAS $$ // source code here$$;

JVM supported languages •  Java, Scala •  Javascript (slow) •  Groovy, Jython, JRuby •  Clojure ( JSR 223 impl issue)

21

UDF Demo

22

@doanduyhai

User Defined Aggregates (UDA)•  Real use-case for UDF

•  Aggregation server-side à huge network bandwidth saving

•  Provide similar behavior for Group By, Sum, Avg etc …

23

@doanduyhai

How to create an UDA ?CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS][keyspace.]aggregateName(type1, type2, …)SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction]INITCOND initCond;

Only type, no param name

State type

Initial state type

24

@doanduyhai


Accumulator function. Signature: accumulatorFunction(stateType, type1, type2, …)

RETURNS stateType

25

@doanduyhai


Optional final function. Signature: finalFunction(stateType)

26

UDA Demo

27

@doanduyhai

Gotchas

28

•  UDA in Cassandra is not distributed !

•  Do not execute UDA on a large number of rows (106 for ex.) •  single fat partition•  multiple partitions•  full table scan

•  à Increase client-side timeout•  default Java driver timeout = 12 secs

@doanduyhai

Cassandra UDA or Apache Spark ?

29

Consistency Level

Single/Multiple Partition(s)

Recommended Approach

ONE Single partition UDA with token-aware driver because node local

ONE Multiple partitions Apache Spark because distributed reads

> ONE Single partition UDA because data-locality lost with Spark

> ONE Multiple partitions Apache Spark definitely

Q & A

! "

30

@doanduyhai

JSON Syntax•  Why ? •  Example

@doanduyhai

Why JSON ?

32

•  JSON is a very good exchange format

•  But a terrible schema …

•  How to have best of both worlds ?•  use Cassandra schema•  convert rows to JSON format

@doanduyhai

JSON syntax for INSERT/UPDATE/DELETE

33

CREATE TABLE users ( id text PRIMARY KEY,

age int, state text );

INSERT INTO users JSON '{"id": "user123", "age": 42, "state": "TX"}’;

INSERT INTO users(id, age, state) VALUES('me', fromJson('20'), 'CA');

UPDATE users SET age = fromJson('25’) WHERE id = fromJson('"me"');

DELETE FROM users WHERE id = fromJson('"me"');

@doanduyhai

JSON syntax for SELECT

34

> SELECT JSON * FROM users WHERE id = 'me';[json]

---------------------------------------- {"id": "me", "age": 25, "state": "CA”}

> SELECT JSON age,state FROM users WHERE id = 'me';[json]

---------------------------------------- {"age": 25, "state": "CA"}

> SELECT age, toJson(state) FROM users WHERE id = 'me'; age | system.tojson(state) -----+---------------------- 25 | "CA"

JSON Syntax Demo

35

Q & A

! "

36

@doanduyhai

SASI index, the search is over!•  Why ? •  How ? •  Who ? •  Demo ! •  When ?

@doanduyhai

Why SASI ?•  Searching (and full text search) was always a pain point for Cassandra

•  limited search predicates (=, <=, <, > and >= only)•  limited scope (only on primary key columns)

•  Existing secondary index performance is poor•  reversed-index•  use Cassandra itself as index storage …

•  limited predicate ( = ). Inequality predicate = full cluster scan 😱

38

@doanduyhai

How ?•  New index structure = suffix trees

•  Extended predicates (=, inequalities, LIKE %)

•  Full text search (tokenizers, stop-words, stemming …)

•  Query Planner to optimize AND predicates

•  NO, we don’t use Apache Lucene

39

@doanduyhai

Who ?•  Open source contribution by an engineers team from …

40

SASI Demo 41

@doanduyhai

When ?•  Cassandra 3.5

•  Later•  support for OR clause : ( aaa OR bbb) AND (ccc OR ddd)•  index on collections (Set, List, Map)

42

@doanduyhai

Comparison

43

SASI vs Solr/ElasticSearch ?•  Cassandra is not a search engine !!! (database = durability) •  always slower because 2 passes (SASI index read + original Cassandra data)•  no scoring •  no ordering (ORDER BY)•  no grouping (GROUP BY) à Apache Spark for analytics

Still, SASI covers 80% of search use-cases and people are happy !

Q & A

! "

44

@doanduyhai

[email protected]

https://academy.datastax.com/

Thank You

45

cassandra 3 new features 2016

Technology