10.3 Log File Replay
The CoordinatorPU holds a complete record of inflight transactions.
If the primary and all backups crash simultaneously, the record of inflight transactions is lost from memory.
(Let us stress that statisticians would argue that the chance of two machines failing at the same time is infinitesimally small,
but CIOs and operations staff will always have one war story about "that electrician...".)
10.3.1 What's Replay for?|
After a crash, some transactions may not have been committed at all, or only partially committed, to the datastore(s).
These are referred to as 'incomplete transactions', because the Coordinator has not done all its work on them.
Incomplete transactions are recreated by replaying them to the datastores.
This process is normally called 'transaction log [file] replay'.
We normally shorten that to 'replay'.
To make replay as fast as possible,
transaction log files are disposed after all the outstanding transactions logged in that file have persisted to the data stores.
This happens as a part of normal operations.
This means that, when CloudTran shuts down normally,
all transactions will have persisted and there are no transactions to replay.
When CloudTran starts up, it checks the files in the log directory.
If there are any incomplete transactions, CloudTran aborts the usual startup.
The error message indicates that the replay program must run before restarting the main application.
10.3.2 Running replay|
The incomplete transactions must be cleared before restarting the main application, by running the 'replay' utility,
which is present in the deployment directory of the CoordinatorPU.
To run this, start a command line, change the current directory to the CoordinatorPU's deployment directory and run 'replay'.
The replay program is a standalone program that must be run while the main application is not running.
Do not attempt to start replay while the main application is running.
When the replay program starts, it uses the configuration properties from the CoordinatorPU (via its config.properties, and possibly additional properties).
This should be sufficient to pick up the actual log directory used.
If the log directory has in fact been specified on the command line, this same directory will need to be fed into the script
Before the replay is run all the log files from all the manager node log directories should be copied to the actual log directory used.
(The log files at each node should be deleted after sucessful replay, the manager node will not start up while they are there.)
The default log file name pattern is as follows:
where the MEMBER_ID is the Coherence id for the node.
Logfile[DATE]-[TIME]_M[MEMBER_ID]_.log - e.g.
You start the log replayer with the following command:
where the CLASSPATH should be the same classpath used to start up the application
and LOG_DIR_PATH is the log directory path.
java -cp CLASSPATH -Dct.logger.directory=LOG_DIR_PATH com.cloudtran.log.impl.LogReplayer
The end result of running replay is that the log files are deleted.
If you use the default disposer which deletes log files, you may want to take a backup before starting replay.
10.3.4 How Replay Finds Transactions To Persist|
Replay works by checking remaining log files in the order they were created.
There are two types of entries in a log file:
Entries are grouped together into blocks; each block is then padded out to a 4096 boundary
(by default - this can be changed by the previously.
If there are errors that are not detected by the disk hardware, they will almost certainly then be caught by the CRC check.
In this case, the action is the same as for a straight data error, as described in the previous paragraph.
- The committed entry - this is written after the transaction is commited at the transaction buffer manager.
- The persisted entry - this is written after the transaction is persisted. This entry is guaranteed to appear after the committed entry.
If errors are detected as above, you may decide for low-value transactions to ignore the failure and continue.
In that case, you can set the ct.replayer.continueAfterError
In this case, the replayer does not stop immediately.
Instead, it will try to continue: in the case of a good read from the disk but a CRC check caught by CloudTran,
this means the replay may well complete.
This happens just once: if there is another unrecoverable error, the replayer quits.
This is on the assumption that the problem is not localised and the likelihood of further errors is high, so it is unwise to carry on.
When no errors are detected, the replayer is interested in those transaction that have a committed entry and no persisted entry;
this is calculated across all extant log files.
- The case where there is a persisted entry but no committed entry is a normal situation.
Log files are kept until all committed entries noted in the log file have been persisted.
So this situation occurs when there are committed entries in this file that did not get persisted.
- Transactions with both committed and persisted entries across the log files are of no interest - we successfully sent them to the datastores.
In general, the part of a transaction relating to any data store can be
10.3.5 Committing Transactions During Replay|
It is up to the logger to distinguish these cases.
When using relational databases,
UPDATE and DELETE operations will run without any error -
if they have already been committed to this datastore, the state will remain unchanged, which is fine.
- committed to the data store - we just didn't get the 'persisted' entry written to the log file before the crash
- not committed.
However, for INSERT however a "constraint violation error" will occur when the transaction has been previously committed.
If we get this error we consider the transaction did successfully persist to this datastore.
Just as for normal operation of CloudTran, replay will try continuously to reach a database if it is unavailable.
As the replay process is single-threaded, the whole process will pause once a needed database becomes unavailable.
As soon as CloudTran has finished with a log file,
it disposes it as described for normal running. In other words,
if you implement your own disposer, it will be used here; otherwise, the files will just be deleted.
10.3.6 Disposing of Log Files|
10.3.7 How to clear the log directory manually?|
To clear the log directory manually just remove/delete the log files from it.