HSM Synchronization: DINAMO Replication #2
To discuss DINAMO Replication Layer (RL) implementation (see the last post), a stronger reasoning tool is needed. It’s time to talk about the CAP Theorem.
NETWORKED SHARED-DATA SYSTEMS
Ever since the CAP Theorem was first stated as a conjecture (Brewer, 2000), there have been many controversies surrounding it (to the point that Brewer wrote a follow-up, CAP Twelve Years Later: How the “Rules” Have Changed, clarifying important points).
Informally, we can state it as following: given 3 desired properties called Consistency, Availability, and Partition Tolerance, a shared-data system can have at most 2 of them (sad, isn’t it?). The original proof provided by Gilbert and Lynch in 2002 has more details.
Consistency would be equivalent of having single up-to-date copy of data. Availability means that every request to that data (including updates) produces a response. Finally, Partition Tolerance is the ability to deal with network failures (e.g., lost messages and broken hardware). In that way, networked shared-data systems could be/have/support AP, CP, or CA (and what does it mean to be CA?). But never CAP.
So much has been said about CAP theorem, specially about the confusing P, that I won’t cover its details here (the CAP FAQ helps to settle some misconceptions). It suffices to acknowledge the fact that its results have practical implications. In other words: despite marketing pitches on CA products, I’m on the side of those who believe that partition tolerance can’t be sacrificed.
If a failure event happens, and the network is partitioned from the point of view of the system participants (i.e., they can’t talk to each other), what will be preferred? Consistency, or Availability?
CP vs AP FOR AN HSM DISTRIBUTED OPERATION
HSMs are in the business of security/cryptography. In that sense, it doesn’t seem prudent to trade Consistency for Availability (a decision that is easier to grasp in split-brain scenarios).
If the network became partitioned, sub-partition replicated operations (like key creation and CSP destruction) would be subject to reconciliation. But what kind of conflict resolution could be applied? Last-write-wins? Merges?
Any adopted strategy should cope with the fact that keys are indivisible entities. And have strong identity semantics. Even if eventual consistency was attainable later (a debatable thing), HSM users could be exposed to legal liabilities (e.g., Brazilian law recognizes digital signatures).
For DINAMO pools, the CP path was taken. This is a pessimistic approach, more in line to my security beliefs. In practice, it means that (some) availability was sacrificed, in exchange for consistency. I believe this is a good trade-off for HSMs, if we take their operating contexts into account.
DINAMO RL, OR SMMRL
Technically speaking, DINAMO RL is a Synchronous Multi-master Replication (SMMR) implementation. In this system, each HSM can accept data changing requests (like key creation/deletion, user updates, etc). Any modified data is then transmitted from the original node to every other pool participant, before a distributed transaction commits.
An HSM is usually employed in setups that produce more read workloads than write ones. Canonical examples are digitally signing something with a private-key, and EFT transaction validation. In both cases, key creation (write) is a discrete event. It happens rarely, if compared to key usage (read).
In fact, SMMR systems are not a good fit for heavy write activity, cause excessive locking may lead to less than ideal performance. But in DINAMO’s case, an immediate benefit is that read requests can be sent to any HSM on a pool, speeding up the typical use-cases.
Another huge advantage is that consistency is a built-in trait. No client-side logic is needed to handle non-deterministic outcomes.
Next stop: DINAMO SMMRL under-the-hood.