1.61.10 • Published 7 years ago

gryadka v1.61.10

Weekly downloads
1
License
MIT
Repository
github
Last release
7 years ago

Gryadka is a minimalistic Paxos-based master-master replicated consistent key/value layer on top of multiple instances of Redis. Using 2N+1 Redis instances Gryadka keeps working when up to N instances become unavailable.

Its core has less than 500 lines of code but provides full featured Paxos implementation supporting such advance features as cluster membership change (ability to add/remove nodes to a cluster) and distinguished proposer optimization (using one round trip to change a value instead of two).

Why

Paxos is a master-master replication protocol. Its inventor, Leslie Lamport wrote that "it is among the simplest and most obvious of distributed algorithms" but many who tried to implement it run into troubles:

This dissonance made me wonder so I challenged myself to write a simple Paxos implementation. I took lines of code as a measure of simplicity and set a limit of 500 lines of code.

Examples

Please see the https://github.com/gryadka/js-example repository for an example of web api built on top of Gryadka and its end-to-end consistency testing.

FAQ

How does it differ from Redis Cluster?

Redis Cluster is responsible for sharding and replication while Gryadka pushs sharding to an above layer and focuses only on replication.

Both systems execute a request only when a node executing the request is "connected" to the majority of the cluster. The difference is in the latency/safety trade-off decisions.

Even in the best case Gryadka requires two consequent round-trips: one is between a client and a proposer, another is between the proposer and the acceptors. Meanwhile Redis uses asynchronous replication and requires just one round trip: between a client and the master.

But the better latency has its price, Redis's docs warns us: "Redis Cluster is not able to guarantee strong consistency. In practical terms this means that under certain conditions it is possible that Redis Cluster will lose writes that were acknowledged by the system to the client.".

Gryadka uses Paxos based master-master replication so lost writes and other consistency issues are impossible by design.

Is it production ready?

No, it's an educational project and was never intended to be in production. It was created:

  • to practice and hone skills in distributed systems
  • to demonstrate that Paxos isn't as complex as it is known to be

Nevertheless Gryadka supports cluster membership change and distinguished proposer optimization so it has all the necessary production features.

Does it support storages other than Redis?

No, the size of a pluggable storage system would have been of the same magnitude as the Gryadka's current Paxos implementation so only Redis is supported.

The good news is that the size of the code is tiny therefore it should be easy to read, to understand and to rewriteit for any storage of your choice.

I heard that Raft is simpler than Paxos, why don't you use it?

Raft is a protocol for building replicated consistent persistent append-only log. Paxos has several flavors. Multi-decree Paxos does the same as Raft (log), but Single-decree Paxos replicates atomic variable.

Yes, Raft looks simpler than Multi-decree Paxos, but Single-decree Paxos is simpler than Raft because with Paxos all the updates happen in-place and you don't need to implement log truncation and snapshotting.

Of course replicated log is a more powerful data structure than replicated variable, but for a lot of cases it's enough the latter. For example, a key-value storage can be build just with a set of replicated variables.

Why did you choose JavaScript and Redis?

Gryadka is an educational project so I chose the most popular language on GitHub and the most popular key/value storage (according to db-engines).

The project looks awesome, do you need any help?

Sure. The more people can understand Gryadka the better. One way of doing it is to keep code as simple as possible, another is to eliminate other barriers:

  • by porting Gryadka from JavaScript to other programming languages
  • by translating README from English to other human languages

Design principle

The main principle of Gryadka is to get rid of everything that isn't essential to the replication and everything that can be implemented on the client side. A lot of things which look essential to replication actually can be implemented as an above layer. Among them are transactions, sharding, consistent backup and leader election.

Transactions

There are a lot of papers, articles and libraries covering or building client-side transactions supporting isolation levels from Read Committed to Serializable. Among them are:

It might also be useful to take a look at "Visualization of RAMP transactions" and "Visualization of serializable cross shard client-side transactions".

Consistent backups

An ability to make consistent backups (aka point-in-time backup, consistent cut/snapshots) looks like an essential feature for a consistent storage but many major storages don't support it.

"MongoDB's docs": "On a running production system, you can only capture an approximation of point-in-time snapshot."

"Cassandra's docs": "To take a global snapshot, run the nodetool snapshot command using a parallel ssh utility ... This provides an eventually consistent backup. Although no one node is guaranteed to be consistent with its replica nodes at the time a snapshot is taken"

"Riak's docs": "backups can become slightly inconsistent from node to node"

Hopefully consistent backups can be implemented on the client side. If a system is based on the actor model and a key/value storage is only used to keep actor's state then it's possible to use Lai–Yang's algorithm or Mattern's algorithm to make consistent snapshots.

Another approach to snapshotting is to backup each key independently, to keep information about a version of the last backup per each key along with a value and to reject all transactions if they touch keys with different backup versions.

Leader election

Naive Paxos implementation uses two round trips between acceptors and proposers to commit a value. Of course a proposer can piggy back the next 'prepare' message on the current 'accept' message. It effectively reduces the number of round trips from two to one if the next update will be issued from the same proposer (otherwise nothing bad happens because Paxos holds consistency in the presence of concurrent proposers).

So the problem of leader election reduces to the problem of how to land most of the user updates to the same node. It can be solved on the above layer with Microsoft Orleans, Uber RingPop or any other consistent hashing routing approach.

Sharding

Sharding is a way to split big key space into disjoint smaller key spaces and host each of them on their own instance of the system in order to overcome the size limitations. The procedure of splitting and joining key spaces should not affect correctness of the concurrent key updates operations.

The straightforward approach is to use transactions to simultaneously put a tombstone to the big key space instance of the system and to init smaller key space with the tombed value. Once all the keys are migrated and all the clients switch to the new key/space topology then it's safe to drop the tombstoned key/values from the original key space.

So sharding can be also pushed to the client side.

API

Gryadka's core interface is trivial. It's a changeQuery function which takes three arguments:

  • a key
  • a change function
  • a query function

Internally changeQuery gets a value associated with the key, applies change to calculate a new value, saves it back and returns query applied to that new value.

The pseudo-code:

class Paxos {
  constuctor() {
    this.storage = ...;
  }
  changeQuery(key, change, query) {
    const value = change(this.storage.get(key));
    this.storage.set(key, value);
    return query(value);
  }
}

By choosing the appropriate change/query functions it's possible to customize Gryadka to fulfill different tasks. A "last write win" key/value could be implemented as:

class LWWKeyValue {
  constuctor(paxos) {
    this.paxos = paxos;
  }
  read(key) {
    return this.paxos.changeQuery(key, x => x, x => x);
  }
  write(key, value) {
    return this.paxos.changeQuery(key, x => value, x => x);
  }
}

A key/value storage with compare-and-set support may look like:

class CASKeyValue {
  constuctor(paxos) {
    this.paxos = paxos;
  }
  read(key) {
    return this.paxos.changeQuery(key, x => x==null ? { ver: 0, val: null}, x => x);
  }
  write(key, ver, val) {
    return this.paxos.changeQuery(key, x => {
      if (x.ver != ver) throw new Error();
      return { ver: ver+1, val: val };
    }, x => x);
  }
}

Network

Gryadka exposes its api via an HTTP interface. It's problematic to pass functions via the network therefore users should put functions on the server and pass names of the functions instead of them. See the src/webapi/mutators folder.

The system is distributed and homogeneous so it has several endpoints and all of them are equal. A user can choose any of them to invoke the changeQuery api; however if all the requests affecting the same key land on the same endpoint then the distinguished proposer optimization kicks in and the requests run twice faster.

Gryadka is based on remote interactions. Remote interactions significantly differ from local - instead of having two possible outcomes of an operation it has three: 'success', 'failure' and 'unknown'. The latter may be returned when an operation timeouts and the true outcome is unknown.

The result of changeQuery reflects all those possibilities.

Consistency

A consistent system is one that does not contain a contradiction. Contradictions happen when a system breaks its promises. Different storages provide different promises so there are different types of consistency: eventual, weak, causal, strong and others.

Gryadka supports linearizability (a promise) so it is a strongly consistent data storage (of course if it holds its promise).

Intuitively, linearizability is very close to thread safety: a system of a linearizable key/value storage and its clients behaves the same way as a thread safe hashtable and a set of threads working with it. The only difference is that a linearizable key/value storage usually is replicated and tolerates network issues and node's crushes without violating the consistency guarantees.

It isn't easy to keep promises, for example Kyle Kingsbury demonstrates in his research that many commercial data storages had consistency issues, among them were: VoltDB, Cassandra, MongoDB and others.

This is the reason why Gryadka was built with consistency testing in mind. Its code to test ratio is 1:5 so for 500 lines of Paxos there are 2500 lines of tests.

Theory

Tests can prove that a program has errors but they can't guarantee correctness. The way to go is to write a program based on a validated model. One can use a formal specification language like TLA+ to describe a model and then check it with a model checker, alternatively an algorithm (a model) can be proved by hand using logic, induction and other math arsenal.

Gryadka uses Single-decree Paxos (Synod) to implement a rewritable register. A write once variant of Synod is proved in Paxos Made Simple paper. The rewritable variant is its extension, I bet there is a paper describing it but I failed to find it so I practiced logic and proved it in this post.

Cluster membership change

The proof easily extends to support read and write quorums of different size which is consistent with the result of Flexible Paxos: Quorum intersection revisited. This idea can be combined with Raft's joint consensus to demonstrate that a simple sequence of steps changes the size of a cluster without violation of consistency.

Simulated network, mocked Redis, fault injections

Testing is done by mocking the network layer and checking consistency invariants during various network fault injections such as message dropping and message reordering.

Each test scenario uses seed-able randomization so all test's random decisions are determined by its initial value (seed) and user can replay any test and expect the same outcome.

Invariants

The following situation is one of the examples of a consistency violation:

  1. Alice reads a value
  2. Alice tells Bob the observed value via an out of the system channel (a rumor)
  3. Bob reads a value but the system returns a value which is older than the rumor

It gives a hint how to check linearizability:

  • Tests check a system similar to CASKeyValue
  • All clients are homogeneous and execute the following loop
    1. Read a value
    2. Change it
    3. Write it back
  • Clients run in the same process concurrently and spread rumors instantly after each read or write operation
  • Once a client observed a value (through the read or write operation) she checks that it's equal or newer than the one known through rumors on the moment the operation started

This procedure already helped to find a couple of consistency bugs so it works :)

In order to avoid a degradation of the consistency test to return true; there is the losing/c2p2k1.i test which tests the consistency check on an a priory inconsistent Paxos configuration (three acceptors with quorums of size 1).

How to run consistency tests:

Prerequisites: node-nightly installed globally, see https://www.npmjs.com/package/node-nightly for details

  1. Clone this repo
  2. cd gryadka
  3. npm install
  4. ./bin/run-consistency-tests.sh all void seed1

Instead of "void" one can use "record" to record all events fired in the system during a simulation after it. Another alternative is "replay" - it executes the tests and compares current events with previously written events (it was useful to check determinism of a simulation).

It takes time to execute all test cases so run-consistency-tests.sh also supports execution of a particular test case: just replace "all" with the test's name. Run ./bin/run-consistency-tests.sh without arguments to see which tests are supported.

Test scenarios

Tests are located in the tests/consistency/scenarios folder grouped into 4 scenarios. All the tests follow the same naming convention so if you see a test named, say, c2p2k1 you can be sure that it checks consistency when there are two proposers and two concurrent clients (c2) working with the same key (k1).

Shuffling

Shuffling tests check consistency in the presence of arbitrary network delays testing the cases when an order of the received messages does't match the order of the sent messages.

Losing

Losing tests deal with arbitrary network delays and arbitrary message loss.

Partitioning

Partitioning tests are about arbitrary network delays, arbitrary message loss and continuous connectivity issues between proposers and acceptors.

Membership

Membership tests check that the consistency holds in the presence of arbitrary network delays and arbitrary message loss during the progress of extending/shrinking a cluster.

NameDescription
c2p2k2.a3.a4tests a process of migration from 3 acceptors to 4 acceptors
c2p2k2.fluxtests a process of continuous extending/shrinking a cluster between 3 and 4 acceptors