CloudTran Home

 
  
<< Back Contents  >  9.  Monitoring Cloudtran-Coherence Transactions Forward >>

9.1 RTView OC Monitor Screens

The Cloudtran team use the RTView monitoring tool from Sherrill-Lubinski (SL) to view the status of the Coherence networks running Cloudtran transactions. We have added some custom screens to the Oracle Coherence Monitor (OCM) and what follows is an explanation of the installation and configuration required for the screens and what each of the screens tell you. It doesn't tell you how to install RTView or OCM. It is assumed that you have already installed the products and have configured OCM to monitor your cluster. Instructions can be found in The Enterprise RTView User Guide, currently found at http://sldownloads.sl.com/docs/rtview/current/user/USERGUIDE.html, and the Oracle Coherence Monitor User Guide.

 9.1.1  Installation
 9.1.2  Overview
 9.1.3  The Transaction Timing Statistics
 9.1.4  Dynamic Manager Configuration Screen
 9.1.5  Isolator Statistics

9.1.1  Installation

The rtv ocm home directory (RTVOC_HOME) has various subdirectories, including projects/myocm. Put the screens, the files with the extension .rtv, into this directory. It is also necessary to add the custom_navtree.xml file into the same directory unless you already have one. If you do, merge the XML contents into your current custom_navtree.xml.

There should be an rtv installation on your machine too. In the RTV_HOME directory there will be a lib subdirectory. The screens pull information through a JMX connection defined in JMXOPTIONS.ini in the lib directory. You need to add a line into the ini file defining the jmx connection called "Agent" expected by the screens. For example for an OCM node (direct connection) use jmxconn Agent - - URL:local - - 'false' or jmxconn Agent localhost 9990 URL:- - - 'false' for an OCM Agent. Change the host and the port number to wherever the JMX bean server on your network is.

The Overview Screen, which shows graphically transactions per second, persistence queues, disk write times, and system commit times. These are all averaged or aggregated over the system as a whole. There's also a transaction totals section showing the number started, committed and persisted over the whole run so far. The second reason is that using the operating system's home/boot disk seems to further reduce performance. For example, on a Linux test system, using a dedicated log disk gave log latency of 10-20ms; using the boot disk, the latency went up to 90ms or more.

The Transaction Timing Statistics Screen. The traces on the left hand side show the times of various phases of the transactions. As long as a consistent set of transactions are being run, you can expect to see nice smooth operation times. All but the bottom traces on the right show pending values: the number of each operations currently being performed. This will normally be quite small. The top two traces on either side are as viewed from the client, the rest from the manager. The bottom right hand side trace shows the three mutually exclusive status of incomplete operations. If the traces show something going wrong, the worst client or worst manager nodes are shown at the top and the relevant client of manager can be selected and the traces will apply to the selected node.

Dynamic Manager Screen. This allows dynamic configuration of some of the parameters affecting managers. If particular managers need to be optimised depending on the type of transactions they are running, the manager can be selected and the parameters changed. The traces on the right show the effect of the configuration changes.

Isolator Statistics. This screen shows up stress and problem areas for the isolator. If the system is running under-capacity, the isolator screen will show zeros. If the database(s) can't keep up for substantial periods, the isolator will start to start to hold up transactions and this will be reflected in some of the values on this screen. The values are totals over the run, so as well as looking at the trending values on the traces, the screen can be viewed after a test run.


9.1.2  Overview

The top left hand trace shows transactions per second. There are Cloudtran distributed transactions. The blur trace is the number of transactions started per second and the yellow trace is the number completed. The yellow committed per second trace is smoothed out over a few seconds to give a better indication of the trend without the peaks and troughs due to batching. This makes it lag behind the blue started per second trace.

The top right trace shows how many transactions are waiting to be persisted after they have been committed to the cache. The yellow trace shows those that are ready to be sent to the database, and includes those that are currently being sent and/or processed. The blue line shows those that are being held up, either because they are in conflict with another transaction and have to be held back for reasons of ordering, or because the database is at capacity and can't process anything further.

If transaction logging is switched on, which it needs to be for full transactional ACIDity, there is a delay in writing to the disk which ensures that transactions are batched and written to the disk, matched to the disk rotation speed, ideally in parallel with the distributed commit phase of the cache write. The faster the disk rotates, the smaller this delay should be. The delay in micro seconds is shown here. If the transaction is logged to a solid state drive, this value should be set to 0. The trace below shows the average length of time the logging write takes.

The last trace shows the times a transaction takes to do the distributed commit. Note that this is not the commit time as seen from the client, but the internal time of the commit within the cluster. It excludes communication time to and from the client and also the persistence time. The yellow trace is the average time in milliseconds. The blue trace is the maximum time, which is typically some multiple of the garbage collection time.

The numbers at the bottom are the total number of transactions that have gone into a particular stage. If the number of started transactions gets increasingly further away from the committed or persisted totals, this could indicate a problem.


9.1.3  The Transaction Timing Statistics

The operation times that run down the left of the screen show the times of various parts of a transaction. The top two are measured from the client node viewpoint, the other four from the perspective of the managers. The first measure is the amount of time it takes for a client to start a new transaction. There is no measure on this screen of the time for the various operations before the commit, and the next measure is the time is takes, from the client's perspective to commit the transaction. Each manager has to request information from the central isolator before doing the distributed commit to ensure that transactions don't conflict. The third measure shown on the screen is the time it takes for the manager to make the request and get the isolator's response. The fourth measure is the time taken for the distributed commit into the various Coherence caches as seen from the transaction's controlling manager. Next is the time taken to make the call from the manager to permanently persist the data in the database. Finally the time taken to write the transaction status to the local cache is shown. These times are not meant to add up to anything; rather they are representative times shown to help pinpoint if something is going wrong. If there is a problem writing to local cache it will show up in many of the traces, but if there is a communications issue between managers, the write transaction status time will be unaffected.

There are three lines on each graph: yellow is the average time, green is the minimum time and blue is the maximum time. All traces show the base 10 log of the times in milliseconds so you can see the trends more clearly (0 = 1ms, 1 = 10ms, 2 = 100ms and so on), and to the right are the values for the current refresh period in ms.. The maxima and minima are those of all transactions handled in the last refresh period (by default one second). To be strictly accurate, the average value is the average of all transactions on each manager (or client) averaged across all managers (clients). Unless different managers or clients have very different transaction profiles, this shouldn't make any difference to having an overall average.

The first five trend graphs on the right hand side show the number of transactions that are currently undergoing the operation shown on the graph to their left. So the first shows the number of transactions currently being started from client threads, and so on. The numbers should fluctuate but remain relatively low, provided the system is not running over capacity. If the numbers on these pending graphs continually climb, this shows that there is a backlog building up in the operation concerned. The brown line shows the maximum number on any node and the yellow line the average number across each node. These don't show log values and unlike the time graphs, they use automatic rescaling. To the right these values for the current refresh period are shown along with the total number of transactions undergoing the operation on all nodes.

The bottom graph on the right hand side is not a pending operation graph. It shows the number incomplete transactions on managers. A transaction that has been started but has had no commit request is called open, one that has had a commit request but has not yet finished committing to the Coherence cache is referred to as committing. A transaction that has been committed, but has not yet persisted in the database is called persisting. For the purpose of this graphs, these terms are mutually exclusive. A transaction that is both committed and persisted is a completed transaction and is not shown here; however there is a total of committed transactions shown at the top right of the screen.

If any of these trends shows a problem, for example if the maximum value diverges too far from the average, it is possible to see whether the issue is with one node or whether there is a systemic problem, by selecting a client or a manager node using the dropdown selectors at the top of the screen. If a client node is selected, the four graphs that show information from the client perspective will only show information for the selected client. Similarly if a manager node is selected the manager graphs will only show information for that manager.

You can scan-click down the nodes to see which, if any, are causing the problem, but to help you home in, the "worst node" ids are shown. The worst client node is the one that has the largest maximum time to commit. The three worst manager nodes are the ones that have the longest isolator request time, the longest distributed commit time and the longest write info to cache time. Long database commit time is most likely a function of the database or communication to the database, narrowing to the individual manger is unlikely to be helpful. If you can't remember which is which, use the mouse-over tooltips.


9.1.4  Dynamic Manager Configuration Screen

This screen allows you to configure various parameters for a particular manager that will change that manager's performance. The first thing to do is select the manager whose parameter you want to alter using the selector in the top left hand corner. This will then show the current state of the parameters down the left hand side and start trace graphs of major transactional information down the right. By altering the parameters on the left, the effect can be seen dynamically on the right.

The first parameter is the debug switch, true or false, which alters the level of diagnostic information that comes out in the logs. More diagnostic has a negative effect on transaction processing.

The next two switch the transaction logging on or off. Database persistence is set to lag behind commits to the cache, and in order to maintain speed, the commit from the client is completed before the persistence to the database is. So if a manager crashes, transaction information would be lost. The transaction logging is the mechanism which bridges the gap, using a temporary local write to disk storage, which holds the information until the database persistence is completed. The transaction logging can occur before or after the cache commit, if set to after it delays the return to the client. If transaction logging is switched off, this can speed up processing, but transactions may be lost if the cluster goes down.

Operation timers provide information on a par with those given by these screens, as entries in the diagnostic logs. The Operation Timer switch turns that mechanism on or off.

The number of logger threads is for that manager, and the logging is shared between the threads assigned to it.

Log write buffer time: Processing time for write tx log. It is typically better to tune MicrosPerLogwrite and leave this value unchanged.

Micros per log write is the amount of time spent between write of the transaction log. Transactions are buffered and the write is timed to coincide with disk rotation speeds to get the fastest write at the outside of the disk.

The number of incomplete transactions should not be allowed to build up indefinitely, as accepting new transactions will cause the system to go even slower. So a maximum number of incomplete transactions is allowed at any one time before the a start transaction called by the client will be refused.

On the right hand side transactions per second are the transactions started on the manager (blue trace) and committed on the manager (yellow trace).

Persistence queues (blue trace, second chart) are those waiting to be sent to the database. The database queue are those having been sent to the database from the manager, but have not yet been confirmed as committed on the database.

The write time to the transaction log is shown on the third chart, blue line, and the size of the queue waiting to be written, is shown by the yellow line.

Maximum and average times to commit to the cluster are shown in the final chart, in milliseconds.


9.1.5  Isolator Statistics

There is only one active isolator, any others act only as backups, so the figures here come from the primary. All figures will remain zero or close to it when the system is running under capacity and nothing is going wrong, but the first two values will rise if the databases are running consistently more slowly than the Coherence caches.

Already Okay To Persist means that a manager asked the isolator whether a transaction was okay to persist, but it was already marked as that.

Already Requested means that a manager asked the isolator whether a transaction was okay to persist, but it had already been requested.

Different Object Now First: an object was requested to be unlocked at the Isolator by one transaction but another transaction was locking it - this is a very unusual case

Invalid Object Type: The isolator has a record of a transactional object being of a different class to the object sent by the manager.

Object Gone: The isolator received a request to release an object, but it had already been unlocked.

The first row of values are totals since the isolator was brought online, rather than the instantaneous values. The second row of values is the same figures shown as a percentage of the total number of committed transactions. As the bottom, the current committed transaction total is also shown.

Copyright (c) 2008-2013 CloudTran Inc.