We’ve just released the GA version of CloudTran V2.0 for Oracle Coherence. The main feature of this release is MVCC (Multi-Version Concurrency Control), which is the focus of this post.
The new release also includes the following features, discussed in later posts:
distributed time synchronization in support of MVCC
transactional replication between data centers
CloudTran is an add-on transaction management product for data grids (e.g. Coherence, GigaSpaces) that provides high-performance ACID transactions across arbitrary numbers of clients, servers and data values. The aim is to simplify development in applications that need consistency, durability or strong ordering semantics with minimal programming effort, which is achieved by providing a transactional wrapper for a data grid cache. It uses the underlying grid features to provide scalability of the transaction manager service, so the complete system (transactions and processing/storage) can scale up and down to meet the expected demand. Transaction managers also support failover, which avoids the “unresolved transaction” problem.
Transactions are optionally durable. So, if you want a cache-only transaction facility, CloudTran works quite happily. If (maybe later in a product’s life) you want to add persistence of grid data to datastores, you can without changing existing code.
Previously, CloudTran has provided pessimistic locking which allows multiple readers of committed values outside of a transaction, but only one active transaction per object. This supports two use cases:
- pessimistic writing, guaranteeing isolation and deadlock protection
- consistent reads (albeit with locking).
The problem with pessimistic locking is the need to effectively lock entries. This slows down simple reads and in some use cases, like financial risk, can cause an unacceptable percentage of transactions failing due to lock conflicts.
MVCC transactions avoid locking problems, and is therefore preferred, for these use cases:
- reporting: read a set of values from different nodes as of a point in time
- volatile pricing: continuous update of hot values, like the price of a financial instrument or a bet
- mostly-read, one writer: prevent simultaneous updates of a given value, for example in eCommerce.
There is also a mode available to prevent the ‘missed update’ problem (aka write skew anomaly – see http://en.wikipedia.org/wiki/Snapshot_isolation) which should be used for moving money around. This is very similar to the strong isolation provided by pessimistic locking.
The current implementation does not maintain indexes that are aware of transactional values, so inconsistencies are possible. For example, a transactional update followed by a search using an index will be inconsistent. Similarly, starting a transaction and then doing a search will retrieve the appropriate value based on the last committed value, rather than the value as of the start time of the transaction. We have yet to see a compelling use-case that justifies the additional overhead and complexity of supporting MVCC-aware indexes. (No doubt we will hear of one soon.)
The data to support MVCC is held in parallel with the entry in the data grid (e.g. in Coherence, using a decoration), meaning that non-transactional readers see the latest committed value. The transactional data for each entry under MVCC consists of
- a list of committed values and their commit time, so a transactional read can retrieve the committed value appropriate to the transaction (i.e. the newest value committed by another application, as long as it was committed before the start of this transaction)
- a list of pending updates, identifying the transaction and the update to perform. When the transaction is secured against missed updates, a ‘read’ entry is also placed into this list, to prevent it being written by another transaction.
For a transactional read, CloudTran first examines the updates list to see if there is an existing value for this transaction; this means that multiple clients can use the cache to coordinate. If there is no update for this transaction, CloudTran uses the list of committed values and returns the latest one committed before the reading transaction started; this provides snapshot isolation on reads. For example, if values were committed at 11:00, 11:01 and 11:03 and the reading transaction started at 11:02, then the 11:01 value is returned. Finally, if the parallel information is missing or does not yield an appropriate value, the non-transactional value is returned.
Therefore, “read your own writes” is guaranteed; and a client never sees another transaction’s dirty value.
For an update, the transaction manager applies the current update policy (e.g. allow concurrent updates) and if the update is allowed, adds a pending update. Note that this approach means that writing the new value is combined with the Prepare of two-phase commit: by the time all the updates have successfully been done, the only thing left to do is to commit.
It is the transaction manager’s responsibility to
- remove pending updates for a transaction if the transaction aborts and
- abort transactions if a client fails.
If a node fails while managing a transaction, another node will become the owner of that transaction and take responsibility for completing the transaction lifecycle (unless the whole grid crashes – in which case we have bigger problems). This avoids the problem of pending updates never getting resolved because a server machine crashed.
The parallel MVCC data needs to be garbage collected. First, as described so far, committed values will just get added into the list and never be deleted. In reality, they are only useful up to the timeout point for the longest-lived transaction. While there are ways to determine this, for now we use a pragmatic approach: use a grid-wide definition of the maximum allowable transaction time. This means that the committed values can be garbage-collected at every access to the MVCC information if they are obsolete.
Updates will also need to be garbage collected if the client crashes before committing or aborting a transaction with some updates placed directly in the cache. Every transaction has a timeout, and that is recorded in the transaction information in the update list; the timeout is used to lazily garbage collect obsolete updates.
Finally, a word on the overhead of using transactions. Distributed transactions have a reputation for being slow because of the overhead of two-phase commit. CloudTran minimises this overhead by being a member of the grid, so it can take advantage of the grid topology and the operations being performed. We mentioned above the technique of combining the ‘Prepare’ with the cache write; this reduces the overhead per changed value to one backed-up write (i.e. to do the ‘Commit’). The other aspect to consider is the cost of calls to the manager to start, commit and complete the transaction; CloudTran can also reduce (but not entirely eliminate) the overhead here, given the appropriate deployment structure.