TDIing out loud, ok SDIing as well

Ramblings on the paradigm-shift that is TDI.

Friday, July 1, 2011

Higher Availability

A question oft asked: How do I make my TDI solution highly available? The answer often boils down to what 'available' means to your solution. In many cases it means that one or more AssemblyLines continue to function. This can be 'wired into' a solution with surprisingly little effort.

To start with, don't create long-running ALs; In other words: don't set the timeout parameter for your Iterator to 'never time out'. Instead, let it wait a bit for new input (e.g. new changelog entries or messages on a queue) and then report End-of-Data so the AL stops. And then restart it.

So when doing a directory sync you let the Change Detection Connector 'listen' for changes and stop if none appear in, say, 10 minutes. Then you have some other process that relaunches the Sync AL: like a cronjob or Windows Scheduled task, or even another AL.

This is the simplest form of HA design, and it also gives you an opportunity to check status and send reports/alerts if needed whenever the Sync AL stops - simply by checking the error object in the AL's After Close Epilog Hook. Since you also expect the AL to stop every once in a while, this can be checked for - for example, using the TDI commandline utility, tdisrvctl, or even another AL. If you detect that the AL is hanging, you can stop and restart it. Furthermore, if an unhandled error occurs and the Sync AL stops abnormally, it is also restarted again.

The idea of monitoring an AL and starting another if it stops seems straightforward. But it is not really that easy. Just because an AL appears to have stopped does not mean it's a good idea to launch a backup. Unless you design for this, running multiple copies of the same AL simultaneously (e.g. reading changelog) may not be a good idea. Also, it is very hard to determine where the failure lies: did the AL stop, has the connection to the TDI Server api been lost, did the TDI server or JVM die, was there a network failure or did the server HW crash, ...? It might be that the AL is waiting for a lock situation on some connected system or resource, or working to re-establish a lost connection. Starting a second copy may serve no purpose.

From experience, the most common situation is that either the AL is hanging - which could be an error in script logic, or I/O latency of connected systems - or it has stopped due to an unexpected (and unhandled) exception. If the AL is actually hanging then it can be killed using api calls or via the tdisrvctl utility, and then restarted. If it has stopped abruptly, restarting is the answer again. If you use a cronjob or scheduled task to (re)launch the AL, then you can be sure the TDI Server is restarted anew each time as well.

Of course, you can use a message queue and divide up your solution into 'feeds' and 'handler' AssemblyLines, allowing you to run multiple copies to increase both performance and availability. You can also use Server mode Connectors to drive solution, since Server mode provides features for pooling and reuse/restart of concurrent AssemblyLines.

These and other techniques and reflections on building robust solutions have been captured by TDI architect Johan Varno and can be found here: http://www.redbooks.ibm.com/redpieces/pdfs/redp4672.pdf

Often the simplest answer is the best one: expect ALs to stop and then restart them again.