The last time Hackerfall tried to access this page, it returned a not found error. A cached version of the page is below, or clickhereto continue anyway

A K/V Store For In-Memory Analytics, Part 2 – H2O blog

This is a continuation of a prior blog on the H2O K/V Store, Part 1.

A quick review on key bits going into this next blog:

The code for all of this can be found in the H2O GIT repo: https://github.com/0xdata/h2o/blob/master/src/main/java/water

The Distributed K/V Store (hereafter, the DKV store), is similar to the hardware cache-coherency protocols MESI, but altered to better match the software layer. In this hardware analogy, a Key is akin to an address in memory, a Value's payload is akin to the data stored at that address – and the Value itself is more like the hardware cache coherency bits. These bits are not directly visible to the X86 programmer, but they have a profound impact on the (apparent) speed of memory, i.e. the cache behavior. As is typical for this kind of thing, the actual implementation is basically a distributed Finite State Machine.

Keys have a “Home Node” (pseudo-randomly assigned – the Home Node is different for each Key). Keys keep a little bit of state cached (mostly Home Node and a proper hash) but do not directly track coherency – that honor is given to the Values.

On the Home Node then, a Value tracks which other Nodes have replicas of this Value, and is responsible for invalidating those replicas when the Key is updated to a new Value. Values also are used to force ordering of writes to the same Key from the same non-home writer – 1 JVM can only have a single outstanding write to the same Key in flight at once (although there's no limit to the number of pending writes to unrelated Keys). Hence parallel writes to unrelated Keys all work in parallel – the limit being the network bandwidth instead of the network latency.

Each Value holds a single AtomicInteger for a read/writer-lock, and a NonBlockingSetInt (really just a fast concurrent bitset), and a pointer to the user's payload. The R/W lock holds the count of pending Gets in-flight, or -1 if the Value is write-locked. The pending-Gets count is typically zero, only going up when a new cached replica is being set in some Node; once the remote Node has the replica (completes it's Get) the pending-Gets count will fall again. The bitset holds the set of Nodes replicating this Value. Gets are implemented in TaskGetKey.java, and are an instance of H2O's reliable RPC mechanism in RPC.java.

On a non-home node, the Value's R/W lock field only holds a flag indicating that the Value is being overridden by a fresh local write, or not.

For a running example, lets assume a 4-node cluster. The cluster members are called A,B,C, and D. We have a single Key, “K1” and start with a single Value “V1”. When writing a Value, we'll follow with the count in the R/W lock, and a list of Nodes in the replica list. Example: “V1[2,A]” means Value V1 has 2 pending-Gets, and is replicated on Node A.

A quiescent Value, with no Gets-in-flight and no replicas:

Actually, this is a 4-node cluster with K1 & V1 on Node A and not on the other nodes. Hence a better picture is:

Node B has K1 and does a Get. This misses in B's local K/V store (see DKV.java), so B does a remote Get (TaskGetKey.java). K1's hash (available directly from K1 without talking to any other cluster member) tells us that K1's home is Node A. Hence B fires off a TaskGetKey from B to A requesting a replica of K1. Here's the flow:

Note that once Node B receives V1, the thread in B waiting for it can immediately begin processing V1. The ACKACK is sent back asynchronously. The request for K1 fits in a UDP packet. Assuming reply V1 does as well, B only has to wait for a round-trip UDP packet send & receive before getting his data. Assuming Nodes C & D make the same request:

At this point any Node in the cluster can retrieve V1 from K1 at no more cost than a hashtable lookup.

What happens when C wants to write a new Value V2 to Key K1? C writes V2 locally, then sends V2 to A. A then resolves any conflicting writes (none here), and invalidates the copies B & D have. Future reads by B & D will then reload the updated value from A. Finally, C gets notification that his write has completed. Here's the flow:

Let's work through this diagram a little.

Let's talk a little about the performance here. What operations are fast, or slow – and why?

These common use cases do not block, and go at either network speeds or memory speeds:

We have yet to talk about:

But… I think I'm out of space!

Comments welcome as always,

Cliff

Continue reading on 0xdata.com