2019-11-22 · ZIO Keeper

Participants:

Maxim Schuwalow (@mschuwalow)
Peter Rowlands
Dejan Mijic (@mijicd)

Agenda:

Discuss Peter's design proposal.

Notes

We discussed a few things yesterday, but the high level idea we agree on is still "95% CRDTs, 5% strongly consistent mutable data". We want to stay with our seperate pluggable layers for membership/crdts and started compiling requirements and possible designs.

Membership

Here we agreed on the following requirements:

We need to suppport a (possibly hierarchical) topic model for membership and broadcast.
We need an algorithm that can operate in different environments depending on tunable parameters. The main axis we want to consider is reliability of nodes.
We need to support a broadcast strategy that reaches all nodes in the cluster very quickly (for example through high fanout values) and optimally supporting acknowledgement. This will be necessary for the consensus protocols.
We should support some sort of immutable facts that can be spread across the cluster such as certificate revocation.
For all of { CRDTs, strongly consistent mutable data, immutable facts } we want to support a time to live.
We want to have encryption and cryptographic signing built in from the beginning. Peter started posting about that.

I believe this gives us enough requirements to begin creating a short list of possible protocol that we could use for this. As a fist step it is probably a good idea to create a long list of all well known ones.

CRDTs

We decided to go with operation-based CRDTs by default, but not overly specialize any parts of the code for them. We will need to support some sort of GC for them, Peter drafted a system of how this might look like

We have periodic snapshots running that are signed and agreed on by the cluster using a consensus mechanism. These snapshots will compact the metadata of a CRDT, but not change their current value. One example could be removal of tombstones.
We defined two periods related to GC:
1. T1 the amount of time for which a snapshot is considered 'fresh'
2. T2 the amount of time for which a snapshot is retained, this should be significantly larger than the longest expected continous downtime of a node.
Each update to an CRDT is signed with the current snapshot. If a node sends an update one of three situations can occur:
1. The age of the snapshot is < T1 -> It is accepted without any further measures
2. The age of the snapshot is > T1, < T2 -> Member is required to update to a newer snapshot
3. The age of the snapshot is > T2 -> The member is forced to discard its own state a do a full sync with another member. This should only happpen if a node has been down for a considerable amount of time.

Consensus

We will require some sort of consensus mechanism for both GC and any form of strongly consistent key-value store. Protocols like raft tend to not scale well with the number of nodes in a cluster, proposition is therefore the following:

We have one consensus protocol running that scales well with number of nodes and elects a smaller group.
This smaller group runs something like RAFT and becomes the source of truth for consensus in the cluster.
If one of the nodes in the inner circle goes down it will be immediately replaced by a successor.
Possibly, we have an active-passive setup with a group that will immediately take over leadership if the first group fails.