the last pickle: repeatable, scalable, reliable, observable: cassandra

CASSANDRA SF 2015

REPEATABLE, SCALABLE, RELIABLE, OBSERVABLE CASSANDRA

Aaron Morton@aaronmorton

Co-Founder & Principal Consultant

Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

http://creativecommons.org/licenses/by-nc/3.0/nz/

About The Last Pickle.

Work with clients to deliver and improve Apache Cassandra based solutions.

Apache Cassandra Committer, DataStax MVP, Apache

Usergrid Committer. Based in New Zealand, Australia, & USA.

DesignDevelopmentDeployment

Scaleable Data Model

Use no look writes to avoid unnecessary reads.

No Look Writes

CREATE TABLE user_visits ( user text, day int, // YYYYMMDD PRIMARY KEY (user, day) );

No Look Writes// Bad

SELECT * FROM user_visits WHERE user = ‘aaron’ AND day = 20150924;

INSERT INTO user_visits (user, day) VALUES ('aaron', 20150924);

No Look Writes// Better




Limit Partition size by bounding it in time or space.

Limit Partition Size// Bad

CREATE TABLE user_visits ( user text, visit_time timestamp, data blob, // up to 100K PRIMARY KEY (user, visit) );

Limit Partition Size// Better

CREATE TABLE user_visits ( user text, day_bucket int, // YYYYMMDD visit_time timestamp, data blob, // up to 100K PRIMARY KEY ( (user, day_bucket), visit) );


Avoid mixed workloads on a single Table to reduce impact

of fragmentation.

Mixed Workloads// Bad

CREATE TABLE user ( user text, password text, // when password changed last_visit timestamp, // each page request PRIMARY KEY (user) );

Mixed Workloads// Better CREATE TABLE user_password ( user text, password text, PRIMARY KEY (user) ); CREATE TABLE user_last_visit ( user text, last_visit timestamp, PRIMARY KEY (user) );


Use LeveledCompactionStrategy

when overwrites or Tombstones.

Use LCS for Overwrites

CREATE TABLE user_visits ( user text, day int, // YYYYMMDD PRIMARY KEY (user, day) ) WITH COMPACTION = { 'class' : 'LeveledCompactionStrategy' };


Create parallel data models so throughput increases with

node count.

Parallel Data Models// Bad

CREATE TABLE hotel_price ( checkin_day int, // YYYYMMDD hotel_name text, price_data blob, PRIMARY KEY (checkin_day, hotel_name) );

Parallel Data Models// Better

CREATE TABLE hotel_price ( checkin_day int, // YYYYMMDD city text, hotel_name text, price_data blob, PRIMARY KEY ( (checkin_day, city), hotel_name) );


Use concurrent asynchronous requests to complete tasks.

Concurrent Asynchronous Requests

CREATE TABLE hotel_price ( checkin_day int, // YYYYMMDD city text, hotel_name text, price_data blob, PRIMARY KEY ( (checkin_day, city), hotel_name) );

Concurrent Asynchronous Requests

// request for cities concurrently SELECT * FROM hotel_price WHERE checkin_day = 20150924 AND city = 'Santa Clara'; SELECT * FROM hotel_price WHERE checkin_day = 20150924 AND city = 'San Jose';


Document when Eventual Consistency, Strong

Consistency or Linerizable Consistency is required.


Smoke Test the data model.

Data Model Smoke Test/* * Get Pricing Data */

// Load Data INSERT INTO city_distances (city, distance, nearby_city) VALUES ('Santa Clara', 0, 'Santa Clara'); INSERT INTO city_distances (city, distance, nearby_city) VALUES ('Santa Clara', 1, 'San Jose');

INSERT INTO hotel_price (checkin_day, city, hotel_name, price_data) VALUES (20150924, 'Santa Clara', 'Hilton Santa Clara', 0xFF); INSERT INTO hotel_price (checkin_day, city, hotel_name, price_data) VALUES (20150924, 'San Jose', 'Hyatt San Jose', 0xFF);

Data Model Smoke Test// Step 1 // Get the near by cities for the one selected by the user

SELECT nearby_city FROM city_distances WHERE city = 'Santa Clara' and distance < 2;

// Step 2 // Parallel requests for each city returned.

SELECT city, hotel_name, price_data FROM hotel_price WHERE checkin_day = 20150924 AND city = 'Santa Clara'; SELECT city, hotel_name, price_data FROM hotel_price WHERE checkin_day = 20150924 AND city = 'San Jose';

DesignDevelopmentDeployment

Application Development

Ensure read requests are bound and know what the size

is.(hint: use auto-paging in 2.0)

Auto Paging

PreparedStatement prepStmt = session.prepare(CQL); BoundStatement boundStmt = new BoundStatement(prepStmt);

boundStatement.setFetchSize(100)


Use appropriate Consistency Level.

(see Data Model Smoke Test)


Use Token Aware Asynchronous requests with

CL ONE where possible.

Token Aware Policycluster = Cluster.builder() .addContactPoints("10.10.10.10") .withLoadBalancingPolicy(new TokenAwarePolicy( new DCAwareRoundRobinPolicy(“DC1”))) .build()

Asynchronous Requests

ResultSetFuture f = ses.executeAsync(stmt.bind("fo")); Row row = f.getUninterruptibly().one();


Avoid DDOS’ing the cluster.

Monitoring and Alerting

Use what you like and what works for you.

Monitoring and Alerting

Some suggestions: OpsCentre, Riemann, Grafana, Log Stash,

Sensu.

How To Monitor

Cluster wide aggregate.All nodes (if possible).

Top 3 & Bottom 3 Nodes.Individual Nodes.

How To Monitor Rates

1 Minute RateDerivative of Counts

How To Monitor Latency

75th Percentile95th Percentile99th Percentile

Monitoring Cluster Throughput.o.a.c.m.ClientRequest.

Write.Latency.1MinuteRate Read.Latency.1MinuteRate

Monitoring Local Table Throughput.o.a.c.m.ColumnFamily.

KEYSPACE.TABLE.WriteLatency.1MinuteRate KEYSPACE.TABLE.ReadLatency.1MinuteRate

Monitoring Request Latency.o.a.c.m.ClientRequest.

Write.Latency.75percentile Write.Latency.95percentile Write.Latency.99percentile Read.Latency.75percentile…

Monitoring Request Latency Per Table.o.a.c.m.ColumnFamily.

KEYSPACE.TABLE.CoordinatorWriteLatency.95percentile

KEYSPACE.TABLE.CoordinatorReadLatency.95percentile

Monitoring Local Table Latency.o.a.c.m.ColumnFamily.

KEYSPACE.TABLE.WriteLatency.95percentile KEYSPACE.TABLE.ReadLatency.95percentile

Monitoring Read Path.o.a.c.m.ColumnFamily.KEYSPACE.TABLE.

LiveScannedHistogram.95percentile

TombstoneScannedHistogram.95percentile

SSTablesPerReadHistogram.95percentile

Monitoring Inconsistency.o.a.c.m.

Storage.TotalHints.count

HintedHandOffManager. Hints_created-IP_ADDRESS.count

.o.a.c.m.Connection.TotalTimeouts.1MinuteRate

Monitoring Eventual Consistency.o.a.c.m.

ReadRepair.RepairedBackground.1MinuteRate

ReadRepair.RepairedBlocking.1MinuteRate

Monitoring Client Errors.o.a.c.m.ClientRequest.

Write.Unavailables.1MinuteRate Read.Unavailables.1MinuteRate Write.Timeouts.1MinuteRate Read.Timeouts.1MinuteRate

Monitoring Errors.o.a.c.m.

Storage.Exceptions.count

Monitoring Disk Usage.o.a.c.m.

Storage.Load.count

ColumnFamily.KEYSPACE.TABLE. TotalDiskSpaceUsed.count

Monitoring Pending Compactions.o.a.c.m.

Compaction.PendingTasks.value

ColumnFamily.KEYSPACE.TABLE.PendingCompactions .value

Compaction.TotalCompactionsCompleted.1MinuteRate

Monitoring Node Performance.o.a.c.m.ThreadPools.request.

MutationStage.PendingTasks.value ReadStage.PendingTasks.value

ReplicateOnWriteStage.PendingTasks.value RequestResponseStage.PendingTasks.value

Monitoring Node Performance.o.a.c.m.DroppedMessage.

MUTATION.Dropped.1MinuteRate READ.Dropped.1MinuteRate

DesignDevelopmentProvisioning

Smoke Tests

“preliminary testing to reveal simple failures severe enough

to reject a prospective software release.”

Disk Smoke Tests

“Disk Latency and Other Random Numbers”

Al Tobyhttp://tobert.github.io/post/2014-11-13-slides-disk-

latency-and-other-random-numbers.html

http://tobert.github.io/post/2014-11-13-slides-disk-latency-and-other-random-numbers.html

Cassandra Smoke Testcassandra-stress write cl=quorum -schema replication\(factor=3\)

-mode native prepared cql3

cassandra-stress read cl=quorum -mode native prepared cql3

cassandra-stress mixed cl=quorum ratio\(read=1,write=4\) -mode native prepared cql3

Run Books

Plan now.

Run Books

Why are we doing this?What are we doing?How will we do it?

Fire Drills

Practice now.

Fire Drill: Short Term Single Node Failure

Down for less than Hint Window.

Available for QUORUM.No action necessary on return.

Fire Drill: Short Term Multi Node Failure (Break the cluster)


Available for ONE (maybe).Repair on return.

Fire Drill: Availability Zone / Rack Partition


Available for QUORUM.Maybe repair on return.

Fire Drill: Medium Term Single Node Failure

Down between Hint Window and gc_grace_seconds.

Available for QUORUM.Repair on return.

Fire Drill: Long Term Single Node Failure

Down longer than gc_grace_seconds.

Available for QUORUM.Replace node.

Fire Drill: Rolling Upgrade

Repeated short term failure.

Available for QUORUM.

Fire Drill: Scale Up

Repeated short term failure.

Available for QUORUM.

Fire Drill: Scale Out

Available for ALL.

Thanks.

Aaron Morton@aaronmorton

Co-Founder & Principal Consultantwww.thelastpickle.com

http://www.thelastpickle.com

the last pickle: repeatable, scalable, reliable, observable: cassandra

Technology