the last pickle: repeatable, scalable, reliable, observable: cassandra
TRANSCRIPT
CASSANDRA SF 2015
REPEATABLE, SCALABLE, RELIABLE, OBSERVABLE CASSANDRA
Aaron Morton@aaronmorton
Co-Founder & Principal Consultant
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
About The Last Pickle.
Work with clients to deliver and improve Apache Cassandra based solutions.
Apache Cassandra Committer, DataStax MVP, Apache
Usergrid Committer. Based in New Zealand, Australia, & USA.
DesignDevelopmentDeployment
Scaleable Data Model
Use no look writes to avoid unnecessary reads.
No Look Writes
CREATE TABLE user_visits ( user text, day int, // YYYYMMDD PRIMARY KEY (user, day) );
No Look Writes// Bad
SELECT * FROM user_visits WHERE user = ‘aaron’ AND day = 20150924;
INSERT INTO user_visits (user, day) VALUES ('aaron', 20150924);
No Look Writes// Better
INSERT INTO user_visits (user, day) VALUES ('aaron', 20150924);
INSERT INTO user_visits (user, day) VALUES ('aaron', 20150924);
Scaleable Data Model
Limit Partition size by bounding it in time or space.
Limit Partition Size// Bad
CREATE TABLE user_visits ( user text, visit_time timestamp, data blob, // up to 100K PRIMARY KEY (user, visit) );
Limit Partition Size// Better
CREATE TABLE user_visits ( user text, day_bucket int, // YYYYMMDD visit_time timestamp, data blob, // up to 100K PRIMARY KEY ( (user, day_bucket), visit) );
Scaleable Data Model
Avoid mixed workloads on a single Table to reduce impact
of fragmentation.
Mixed Workloads// Bad
CREATE TABLE user ( user text, password text, // when password changed last_visit timestamp, // each page request PRIMARY KEY (user) );
Mixed Workloads// Better CREATE TABLE user_password ( user text, password text, PRIMARY KEY (user) ); CREATE TABLE user_last_visit ( user text, last_visit timestamp, PRIMARY KEY (user) );
Scaleable Data Model
Use LeveledCompactionStrategy
when overwrites or Tombstones.
Use LCS for Overwrites
CREATE TABLE user_visits ( user text, day int, // YYYYMMDD PRIMARY KEY (user, day) ) WITH COMPACTION = { 'class' : 'LeveledCompactionStrategy' };
Scaleable Data Model
Create parallel data models so throughput increases with
node count.
Parallel Data Models// Bad
CREATE TABLE hotel_price ( checkin_day int, // YYYYMMDD hotel_name text, price_data blob, PRIMARY KEY (checkin_day, hotel_name) );
Parallel Data Models// Better
CREATE TABLE hotel_price ( checkin_day int, // YYYYMMDD city text, hotel_name text, price_data blob, PRIMARY KEY ( (checkin_day, city), hotel_name) );
Scaleable Data Model
Use concurrent asynchronous requests to complete tasks.
Concurrent Asynchronous Requests
CREATE TABLE hotel_price ( checkin_day int, // YYYYMMDD city text, hotel_name text, price_data blob, PRIMARY KEY ( (checkin_day, city), hotel_name) );
Concurrent Asynchronous Requests
// request for cities concurrently SELECT * FROM hotel_price WHERE checkin_day = 20150924 AND city = 'Santa Clara'; SELECT * FROM hotel_price WHERE checkin_day = 20150924 AND city = 'San Jose';
Scaleable Data Model
Document when Eventual Consistency, Strong
Consistency or Linerizable Consistency is required.
Scaleable Data Model
Smoke Test the data model.
Data Model Smoke Test/* * Get Pricing Data */
// Load Data INSERT INTO city_distances (city, distance, nearby_city) VALUES ('Santa Clara', 0, 'Santa Clara'); INSERT INTO city_distances (city, distance, nearby_city) VALUES ('Santa Clara', 1, 'San Jose');
INSERT INTO hotel_price (checkin_day, city, hotel_name, price_data) VALUES (20150924, 'Santa Clara', 'Hilton Santa Clara', 0xFF); INSERT INTO hotel_price (checkin_day, city, hotel_name, price_data) VALUES (20150924, 'San Jose', 'Hyatt San Jose', 0xFF);
Data Model Smoke Test// Step 1 // Get the near by cities for the one selected by the user
SELECT nearby_city FROM city_distances WHERE city = 'Santa Clara' and distance < 2;
// Step 2 // Parallel requests for each city returned.
SELECT city, hotel_name, price_data FROM hotel_price WHERE checkin_day = 20150924 AND city = 'Santa Clara'; SELECT city, hotel_name, price_data FROM hotel_price WHERE checkin_day = 20150924 AND city = 'San Jose';
DesignDevelopmentDeployment
Application Development
Ensure read requests are bound and know what the size
is.(hint: use auto-paging in 2.0)
Auto Paging
PreparedStatement prepStmt = session.prepare(CQL); BoundStatement boundStmt = new BoundStatement(prepStmt);
boundStatement.setFetchSize(100)
Application Development
Use appropriate Consistency Level.
(see Data Model Smoke Test)
Application Development
Use Token Aware Asynchronous requests with
CL ONE where possible.
Token Aware Policycluster = Cluster.builder() .addContactPoints("10.10.10.10") .withLoadBalancingPolicy(new TokenAwarePolicy( new DCAwareRoundRobinPolicy(“DC1”))) .build()
Asynchronous Requests
ResultSetFuture f = ses.executeAsync(stmt.bind("fo")); Row row = f.getUninterruptibly().one();
Application Development
Avoid DDOS’ing the cluster.
Monitoring and Alerting
Use what you like and what works for you.
Monitoring and Alerting
Some suggestions: OpsCentre, Riemann, Grafana, Log Stash,
Sensu.
How To Monitor
Cluster wide aggregate.All nodes (if possible).
Top 3 & Bottom 3 Nodes.Individual Nodes.
How To Monitor Rates
1 Minute RateDerivative of Counts
How To Monitor Latency
75th Percentile95th Percentile99th Percentile
Monitoring Cluster Throughput.o.a.c.m.ClientRequest.
Write.Latency.1MinuteRate Read.Latency.1MinuteRate
Monitoring Local Table Throughput.o.a.c.m.ColumnFamily.
KEYSPACE.TABLE.WriteLatency.1MinuteRate KEYSPACE.TABLE.ReadLatency.1MinuteRate
Monitoring Request Latency.o.a.c.m.ClientRequest.
Write.Latency.75percentile Write.Latency.95percentile Write.Latency.99percentile Read.Latency.75percentile…
Monitoring Request Latency Per Table.o.a.c.m.ColumnFamily.
KEYSPACE.TABLE.CoordinatorWriteLatency.95percentile
KEYSPACE.TABLE.CoordinatorReadLatency.95percentile
Monitoring Local Table Latency.o.a.c.m.ColumnFamily.
KEYSPACE.TABLE.WriteLatency.95percentile KEYSPACE.TABLE.ReadLatency.95percentile
Monitoring Read Path.o.a.c.m.ColumnFamily.KEYSPACE.TABLE.
LiveScannedHistogram.95percentile
TombstoneScannedHistogram.95percentile
SSTablesPerReadHistogram.95percentile
Monitoring Inconsistency.o.a.c.m.
Storage.TotalHints.count
HintedHandOffManager. Hints_created-IP_ADDRESS.count
.o.a.c.m.Connection.TotalTimeouts.1MinuteRate
Monitoring Eventual Consistency.o.a.c.m.
ReadRepair.RepairedBackground.1MinuteRate
ReadRepair.RepairedBlocking.1MinuteRate
Monitoring Client Errors.o.a.c.m.ClientRequest.
Write.Unavailables.1MinuteRate Read.Unavailables.1MinuteRate Write.Timeouts.1MinuteRate Read.Timeouts.1MinuteRate
Monitoring Errors.o.a.c.m.
Storage.Exceptions.count
Monitoring Disk Usage.o.a.c.m.
Storage.Load.count
ColumnFamily.KEYSPACE.TABLE. TotalDiskSpaceUsed.count
Monitoring Pending Compactions.o.a.c.m.
Compaction.PendingTasks.value
ColumnFamily.KEYSPACE.TABLE.PendingCompactions .value
Compaction.TotalCompactionsCompleted.1MinuteRate
Monitoring Node Performance.o.a.c.m.ThreadPools.request.
MutationStage.PendingTasks.value ReadStage.PendingTasks.value
ReplicateOnWriteStage.PendingTasks.value RequestResponseStage.PendingTasks.value
Monitoring Node Performance.o.a.c.m.DroppedMessage.
MUTATION.Dropped.1MinuteRate READ.Dropped.1MinuteRate
DesignDevelopmentProvisioning
Smoke Tests
“preliminary testing to reveal simple failures severe enough
to reject a prospective software release.”
Disk Smoke Tests
“Disk Latency and Other Random Numbers”
Al Tobyhttp://tobert.github.io/post/2014-11-13-slides-disk-
latency-and-other-random-numbers.html
Cassandra Smoke Testcassandra-stress write cl=quorum -schema replication\(factor=3\)
-mode native prepared cql3
cassandra-stress read cl=quorum -mode native prepared cql3
cassandra-stress mixed cl=quorum ratio\(read=1,write=4\) -mode native prepared cql3
Run Books
Plan now.
Run Books
Why are we doing this?What are we doing?How will we do it?
Fire Drills
Practice now.
Fire Drill: Short Term Single Node Failure
Down for less than Hint Window.
Available for QUORUM.No action necessary on return.
Fire Drill: Short Term Multi Node Failure (Break the cluster)
Down for less than Hint Window.
Available for ONE (maybe).Repair on return.
Fire Drill: Availability Zone / Rack Partition
Down for less than Hint Window.
Available for QUORUM.Maybe repair on return.
Fire Drill: Medium Term Single Node Failure
Down between Hint Window and gc_grace_seconds.
Available for QUORUM.Repair on return.
Fire Drill: Long Term Single Node Failure
Down longer than gc_grace_seconds.
Available for QUORUM.Replace node.
Fire Drill: Rolling Upgrade
Repeated short term failure.
Available for QUORUM.
Fire Drill: Scale Up
Repeated short term failure.
Available for QUORUM.
Fire Drill: Scale Out
Available for ALL.
Thanks.
Aaron Morton@aaronmorton
Co-Founder & Principal Consultantwww.thelastpickle.com