call me maybe: jepsen and flaky networks
TRANSCRIPT
Typical first year for a new cluster
— Jeff Dean, Google
• ~5 racks out of 30 go wonky (50% packetloss)
• ~8 network maintenances (4 might cause ~30-minute random connectivity losses)
• ~3 router failures (have to immediately pull traffic for an hour)
LADIS 2009
Reliable networks are a myth
• GC pause
• Process crash
• Scheduling delays
• Network maintenance
• Faulty equipment
Messages can be lost, delayed, reordered and duplicated
n1
n2X
n1
n2
Time
Drop
Delay
n1
n2
Duplicate
n1
n2
Reorder
CAP recap
• Consistency (Linearizability): A total order on all operations such that each operation looks as if it were completed at a single instant.
• Availability: Every request received by a non-failing node in the system must result in a response.
• Partition Tolerance: Arbitrary many messages between two nodes may be lost. Mandatory unless you can guarantee that partitions don’t happen at all.
Have you planned for these?
Availability
Consistency
X
X
• Errors
• Connection timeouts
• Hung requests (read timeouts)
• Stale results
• Dirty results
• Data lost forever!
During and after a partition
Jepsen: Testing systems under stress
• Network partitions
• Random process crashes
• Slow networks
• Clock skewhttp://github.com/aphyr/jepsen
Anatomy of a Jepsen test
• Automated DB setup
• Test definitions a.k.a Client
• Partition types a.k.a Nemesis
• Scheduler of operations (client & nemesis)
• History of operations
• Consistency checker
Data store specific
(Mongo/Solr/Elastic)
Provided by Jepsen
Nemesis
n1
n2
n3
n4
n5
partition-halves
n1
n4
n5
n2
n3
partition-random-halves
n1
n2
n4
n5
bridge
n3
A set of integers: cas-set-client
• S = {1, 2, 3, 4, 5, …}
• Stored as a single document containing all the integers
• Update using compare-and-set
• Multiple clients try to update concurrently
• Create and restore partitions
• Finally, read the set of integers and verify consistency
Compare and Set client
cas({}, 1)
cas(1, 2)
{1}
{1, 2}
cas(1, 3) X
Time
Client 1
Client 2
cas(2, 4) X
cas(2, 5) {1, 2, 5}
Client 1
Client 2
t=0 t=1 t=x
Compare and Set client
cas({}, 1)
cas(1, 2)
{1}
{1, 2}
cas(1, 3) X
Time
Client 1
Client 2
cas(2, 4) X
cas(2, 5) {1, 2, 5}
Client 1
Client 2
t=0 t=1 t=x
History = [(t, op, result)]
Solr
• Search server built on Lucene
• Lucene index + transaction log
• Optimistic concurrency, linearizable CAS ops
• Synchronous replication to all ‘live’ nodes
• ZooKeeper for ‘consensus’
• http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky-networks/
Solr - Are we safe?
• Leaders become unavailable for upto ZK session timeout, typically 30 seconds (expected)
• Some write ‘hang’ for a long time on partition. Timeouts are essential. (unexpected)
• Final reads under CAS are consistent but we haven’t proved linearizability (good!)
• Loss of availability for writes in minority partition. (expected)
• No data loss (yet!) which is great!
Solr - Bugs, bugs & bugs
• SOLR-6530: Commits under network partition can put any node into ‘down’ state.
• SOLR-6583: Resuming connection with ZK causes log replay
• SOLR-6511: Requests threads hang under network partition
• SOLR-7636: A flaky cluster status API - times out during partitions
• SOLR-7109: Indexing threads stuck under network partition can mark leader as down
Elastic
• Search server built on Lucene
• It has a Lucene index and a transaction log
• Consistent single doc reads, writes & updates
• Eventually consistent search but a flush/commit should ensure that changes are visible
Elastic
• Optimistic concurrency control a.k.a CAS linearizibility
• Synchronous acknowledgement from a majority of nodes
• “Instantaneous” promotion under a partition
• Homegrown ‘ZenDisco’ consensus
Elastic - Are we safe?
• “Instantaneous” promotion is not. 90 seconds timeouts to elect a new primary (worse in <1.5.0)
• Bridge partition: 645/1961 writes acknowledged and lost in 1.1.0. Better in 1.5.0, only 22/897 lost.
• Isolated primaries: 209/947 updates lost
• Repeated pauses (simulating GC): 200/2143 updates lost
• Getting better but not quite there. Good documentation on resiliency problems.
MongoDB
• Document-oriented database
• Replica set has a single primary which accepts writes
• Primary asynchronously replicates writes to secondaries
• Replica decide between themselves to promote/demote primaries
• Applies to 2.4.3 and 2.6.7
MongoDB
• Claims atomic writes per document and consistent reads
• But strict consistency only when reading from primaries
• Eventual consistency when reading from secondaries
MongoDB - Are we safe?
Source: https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads
MongoDB - Are we really safe?
• Inconsistent reads are possible even with majority write concern
• Read-uncommitted isolation
• A minority partition will allow both stale reads and dirty reads
Conclusion
• Network communication is flaky! Plan for it.
• Hackernews driven development (HDD) is not a good way of choosing data stores!
• Test the guarantees of your data stores.
• Help me find more Solr bugs!
References
• Kyle Kingsbury’s posts on Jepsen: https://aphyr.com/tags/jepsen
• Solr & Jepsen: http://lucidworks.com/blog/call-maybe-solrcloud-jepsen-flaky-networks/
• Jepsen on github: github.com/aphyr/jepsen
• Solr fork of Jepsen: https://github.com/LucidWorks/jepsen
Solr/Lucene Meetup on 25th July 2015Venue: Target Corporation, Manyata Embassy Business Park
Time: 9:30am to 1pm
Talks:
Crux of eCommerce Search and Relevancy
Creating Search Analytics Dashboards
Signup at http://meetu.ps/2KnJHM