Distributed Timestamps for MVCC

Timestamps are important for CloudTran MVCC (details on MVCC in the previous post). They are recorded at transactions managers (a) when a transaction starts and (b) when it commits.

As noted in that post, timestamps are typically used to get consistent reads: the start time of a reading transaction is compared to the commit time of the transactions that committed values into the cache. They can also be used for strict demarcation of transactions, as discussed later.


Time has to be coordinated across the grid: in other words, we need distributed time synchronization, encompassing many machines and JVMs. We are assuming the grid is on a LAN (not WANS). We’ll come back to WANs in a later post.

Out of the box, CloudTran provides two plug-ins for synchronizing time; you can add your own too (e.g. if you have a 1588/PTP driver).

The obvious method for time sync uses a singleton service that returns its local time (i.e. System.nanoTime() in Java). Time therefore runs at the rate of the crystal on the machine hosting the singleton service. In the case of failover, a replacement node takes over the service.

The singleton approach is easy to understand and gives you definitive ordering of events… but at the expense of two RPC calls per transaction. This adds significant latency – a couple of milliseconds per transaction on a GigE network – and also adds extra load to the machine running the singleton service, which can become a bottleneck and limit scalability.

The other approach is to estimate time on each transaction manager JVM (CloudTran normally runs many transaction managers). This polls the singleton service a few times a second but, rather than use the timestamps directly, interposes a service thread that deduces the global time. The main challenges in this calculation are:

  • differences in the frequencies of the crystals on the machines – 1-100 ppm (parts per million) accuracy are common between different types of computers
  • temperature variations, which changes the generated frequency. Although this effect is likely to be much smaller than the base frequency effect, it means you can never stop the service because the master frequency can change
  • the delays introduced by the networking stack, OS scheduling and Java GC’s.

The distributed time service works as follows:

  • get the timestamp ‘globalTime’ from the singleton service. We also get the local time before and after. The difference between the before and after times is the Round-Trip Time (RTT); the average of these two gives us ‘myTime’ on the local JVM. Putting them together, we have a {myTime, globalTime} data point.
  • by collecting lots of these {myTime, globalTime} points, we can deduce a straight-line graph
         globalTime = m * myTime + c
  • when we need to calculate a time, get myTime from nanoTime() and use the equation to calculate globalTime.

There are many details to getting an accurate estimate, of which the main ones are:

  • longer distance on the X axis reduces the variability (just like long wheel-base cars); 30 minutes is a typical collection length, but time is good enough for MVCC in a few minutes.
  • better estimates of time come from shorter round trips, so we filter out timings with longer roundtrips
  • if we did this filtering across the board, it would result in throwing away long sequences of readings, because roundtrip times go into quick and slow phases (i.e. get better or worse over 10’s of seconds). If these chopped readings were at the beginning (v. likely) or end, we reduce the X axis length and therefore accuracy. To avoid this, we break the graph into (e.g.) 10 pieces and do the filtering on each piece individually
  • we use a least-squares linear regression to get y=mx+c, using 128-bit arithmetic (those nanoseconds are important!)

The net result of this approach is that time can be served locally to an accuracy of c. +/-10 microsecs on GigE networks, or 20 microseconds variance between two nodes. The crucial thing for our discussion is that this is much less than the round-trip time (which, on our GigE network, within Coherence is greater than 700 microsecs). Between nodes on the same physical machine, the distributed time accuracy was +/-4 microsecs (or 8 microsecs variance) and the round-trip time was >200 microsecs.

Apart from their use in MVCC, these timestamps are useful for distributed logs because they are cheap (getting a time from the distributed service takes around 300ns on one core of a commodity Linux machine) and their accuracy (<< RTT time) means they can be used to deduce the order of events for a multi-node operation.

Application in MVCC

Distributed timestamps are good enough for use in MVCC. How can that be?

First, let’s address the easy one – timestamps for reading. How do we get a consistent view of some few objects, or indeed the whole grid, for example to get the units sold and the quantity to order for each product? Regardless of whether the transaction is distributed, the key to consistent reporting is to use transactions. They guarantee atomic and consistent state changes. If a particular set of objects is being changed at the time the read transaction starts, the new values should be used if they are committed before the start time; in other words, this boils down to ‘read committed’. For a grid-wide report, all the transactions must have completed; this can be guaranteed by starting a transaction with an exceptionally long timeout, waiting for the normal maximum transaction timeout, then reading the values.

For most business requirements, whether the actual timestamp from our distributed time server is highly accurate is not important – it is not directly related to the time of day anyway. The business will be more concerned that the report gets run, and then that it gets run on time. And consistency is served by having a “happens-before” metric – to decide whether the commit happened before the ask (read transaction) time. So our accuracy of a few microseconds is quite adequate for this.

Now let’s consider how to implement the update policy that prevents against write-write conflicts (e.g. eCommerce) or read-write conflicts (e.g. moving money around).

As noted in the previous post, there are two lists attached to each entry in the cache: a “pending list” holding the updates (and reads, sometimes) that may be committed by the transaction; and a committed list, with the new value and timestamp information for each commit.

CloudTran can immediately fail a transaction’s proposed update/read if another transaction already has a conflicting entry in the list – it doesn’t need timestamps to figure this out. But, when there are delays or long-running transactions and an overlapping transaction has already committed, we need to compare timestamps to check for conflict.

For example, here are some events (ordered by global time!) which interact at cache Entry X:

Start Transaction{A}
Read Entry X{A}
Start Transaction{B}
Update Entry X{B}
Commit Transaction{B}
Update Entry X{A}

If the policy for Transaction A prevents write-write conflicts, CloudTran must fail Update Entry X{A}, because the transactions overlap – Update Entry X{B} has occurred during transaction A. For example, in an eCommerce application, Entry X could represent all orders for a given customer. If we let Update Entry X{A} through, when Transaction A commits we will lose the changes from Transaction B (e.g. we’d lose the orders added by it).

The point of this example is that transaction B’s update will now be in the committed list, so CloudTran must also examine the timestamps of entries in the committed list to check for conflict. Diagrammatically, the conflict arises as follows:

Read Entry X{A} Update

So the check for a conflict in this case boils down to the test

     GlobalTime(Start Transaction{A}) < Global Time(Commit Transaction{B})

If the timestamps were inaccurate enough, we could have the situation where

     GlobalTime(Commit Transaction{B}) < Global Time(Start Transaction{A})

incorrectly implying the following order – the two transactions did not overlap:

Read Entry X{A} Update

So, how accurate do distributed timestamps have to be to guarantee that the conflict will be detected?

The answer is, much less than two roundtrip times. This is caused by the backups that need to be written for Start Transaction{A} and Update Entry X{B}. The detailed (tedious, even) derivation of this result is given below.

The case described above is the nastiest edge case, and gives quite a relaxed requirement on distributed time. Although this has been developed for CloudTran-Coherence, the general approach should be applicable to any LAN-connected system needing ordering/consistency where backups (and the delays they bring) can be assumed.

Derivation of “Two Roundtrip Times” result

We need to establish the minimum time between Time(Start Transaction{A}) and Time(Commit Transaction{B}), so we can be sure we don’t get these the wrong way round using distributed time across different nodes.

Start Transaction{A} Occurs on Tx A’s manager node, possibly (but not necessarily different) from manager B node and primary data cache node.
Timestamp taken at start.
Elapsed time: 1 RTT to backup.
Happens before Read Entry (because it needs the transaction).
Read Entry X{A} Occurs on the data node holding X.
Elapsed time may be negligible if on same node as client.
Must happens before the Update, because on Coherence accesses to the cache lock the entry; and after that the update is present.
(If we don’t do the Read at this point, then the read gets the updated value and we don’t have a conflict.
Update Entry X{B} Occurs on the data node holding X.
Happens after Read Entry X{A} (because the Read would fail if checking for write conflicts and UpdateEntry X{B} were present) and Read Entry X{B}. This orders transactions A and B.
Elapsed time: 1 RTT to backup.
Happens before Commit Transaction{B}.
Commit Transaction{B} Occurs on Tx B’s manager node.
Happens after Update Entry X{B}.
Elapsed time: negligible – timestamp taken at start of commit processing.

There will be other roundtrips in typical transactions, but for this analysis we’ve ignored them. All the steps here have a happens-before ordering, with an absolute minimum of 2 RTTs delay. These RTT’s will be between physical machines, because that is the safe deployment for backups. So, for distributed timestamps to cause us to deduce the wrong order, the inaccuracy would have to be greater than 2 RTTs. As the method has an accuracy of 1/25 or 1/30 of an RTT, this is impossible.

Comments are closed.