TDIing out loud

Ramblings on the paradigm-shift that is TDI.

Wednesday, September 2, 2009

Clustering, HA and Scaling, Oh My!

Ok, here comes some ramblings on making better TDI solutions. Risk reduction is always a matter of expense versus exposure, and this approach requires an upfront investment and adds lots of moving parts, so it's not a size that fits all and pays most dividends for larger projects, or for series of similar smaller ones. That said...

You start by deconstructing a solution into a set of individual service AssemblyLines:
  1. Error Handler - First consider that all error hooks/scripts in all other ALs dispatch to a central error queue. Then you have this Error Handler AL that continuously iterates this queue and does logging/alerting/reacting as needed. This AL also emits a periodic heartbeat to an event queue*.

    * That's right, 2 queues. And note that you will want persistent queueing like MQ, the System Store/DB, files, etc.

  2. Event Handlers (plural) - This set of service ALs catch events - e.g. detecting changes; listening for incoming messages, mail or requests; polling for new/changed files, and so forth. Each AL pushes caught events to the event queue, along with its own heartbeats. As mentioned above, all errors are dispatched to the error queue.

  3. Workers (plural). These each grab the next relevant event off the queue and then performs a required action, like writing detected changes to a single target (e.g. for a sync solution you would have one Worker AL to write to Domino, one for AD, one for SAP, etc.), passing events to other systems (i.e. switching), performing searches and building responses (rss/REST/SOAP/...) or whatever. Each also reports status through the event queue and problems to the error queue.

  4. Heartbeat Monitor - Polls the queues to make sure things are happening, for example that events are being processed in a timely fashion, and that heartbeats are received (and cleaned up). If the Error Handler is down, it does its own alerting and logging (it can even send events to AMC or a backup TDI Server).
This approach is suited for unit testing and provides better solution availability and maintainability than one-stop-shopping ALs tend to do. It also scales easily - just run more ALs, using inter-AL comms for coordination.

But it's definitely not for the lighthearted, or those not comfortable in the AL Debugger. However, if you do it right, you end up with a reusable the AL service framework.

And if you send it to me for publication then I'll send you a limited edition, orange plastic Metamerge pen :)

These really are great pens :)


Anonymous said...

It is very interesting for me to read the post. Thanks for it. I like such topics and everything that is connected to them. I definitely want to read more on that blog soon.

Trace said...

Hey Eddie,

I'm resurrecting a very old post, but I've been putting a lot of thought into creating a version of the HA framework you described.

What is the reasoning behind separating the error/logging queue from the regular event queue? The Error Handler AL seems to me to be another worker AL (with maybe a few more logic branches).

Since the workers you describe support a single target, the event queue needs to differentiate requests based on which worker can perform the supported operation/event. If you have your event types already separated, why not have the Error Handler read messages from the main queue?

For worker heartbeats, what iterator would you have used? Long lived workers I imagine would need to emit a heartbeat periodically or by request of the monitor. Using the system queue connector as an iterator and then reposting the heartbeat request to the queue until all workers had responded resulted in some...interesting results.

Thanks for all the ideas/examples/experience you've shared!

Eddie Hartman said...

@Trace Excellent questions. Please don't take this rant too literally on all counts. The important piece is the separation of duties between those ALs that feed events into the solution and those that handle them. The Feed ALs being as short and simple as possible, to ensure that these events - changes, messages, etc - get picked up and persisted someplace (queue, db, files, ...). It is the Worker ALs that handle these 'service requests' (i.e. request for some kind of service to be performed) that necessarily have more sources of problems, from multiple system connections that can fail, to complex business logic, to externalized logic, maps and settings that can be invalid.

So, step one is to envision a framework with this division of duties. Whether or not all errors are passed to an Error Queue/DB and dealt with by a dedicated AL will depend on the size of your project. In particular, if it spans multiple Configs then a set of central services can reduce duplication in your code.

Another think that I have been adding to recent projects is a REST interface. As of TDI 7.1.1 FP4 and SDI 7.2. FP2 there is a built-in REST service in the TDI server that lets you invoke an AL via a URI like this in a browser:


Note that both config name and al name are case sensitive. The referenced AL is executed, and any data found in the Work Entry at the end of execution is turned into a JSON payload and returned to the browser. So it is simply add REST calls like /status or /metrics and return information on how well (and fast) the solution is performing. Of course, you have to capture these metrics and store them someplace shared between ALs. If multiple TDI servers are involved, then a database (like a shared System Store) can be used.

As to the heartbeat AL, if you provide a REST call to query the status of solution ALs, then this can either be called from a browser, or from another TDI AL designed to periodically take the pulse of the solution. One cat, many skinning options...

And you have me ranting anew, Trace :)