Operational concerns

KTL client library compatibility

A KTL client library must be compatible with KTL Python in order to be used with keygrabber. The best way to check compatibility for a given service is to invoke KTL Python’s unit test suite against the service in question. Though all of the tests should pass, particular attention should be paid to the ktl.Keyword.callback() and ktl.Keyword.monitor() tests.

Keygrabber assumes that subscribing to KTL keyword broadcasts is a low-risk, low-resource operation, one that is unlikely to have a significant operational impact on critical KTL services or otherwise visibly consume computing resources outside the host on which keygrabber is running.

These assumptions are valid for most KTL implementations, including the dtune KTL client library. Services based on the KTL RPC framework, in particular, have operational failure modes that can be triggered if a client application does not promptly handle broadcasts as they are issued. With respect to this issue, KTL Python is consistent and prompt in its handling of KTL broadcasts, which are immediately put onto an internal queue for subsequent processing by a background thread; no variable or unpredictable handling occurs in a KTL context, and is thus effectively invisible to the underlying KTL service.

This is not to suggest that failure is impossible, as there are other mechanisms that might trigger a backlog of unhandled broadcast events, such as a network or computer failure. If a service has a high volume of large broadcasts, the additional network traffic could create operational problems. These risks should be assessed and taken into account when using keygrabber in a production capacity.

Note

There are some KTL services, admittedly older ones, that do not support KTL broadcasts for all readable keywords. KTL Python will be expanded to include polling support for such keywords, where synchronous read requests will be issued; as of May 2013, that support is not yet in place.

Database maintenance

The performance of PostgreSQL queries will slowly degrade in the absence of regular maintenance, especially when frequent UPDATE or DELETE queries are issued. The vast majority of queries generated by keygrabber are INSERT queries, though the default recording parameters will usually trigger some degree of UPDATE queries.

This slow degradation is mitigated entirely by regular performance of VACUUM operations. Starting with PostgreSQL 8.3, the autovacuum option is enabled by default; in earlier versions, it is strongly recommended that it be enabled explicitly. If autovacuum is not enabled, make sure to regularly perform explicit VACUUM operations.

Depending on the objective, it may also be sensible to periodically purge old records from the database. This is especially true if keygrabber is set to record all events, rather than its default behavior of discarding duplicate and/or high-frequency events. For example, a keygrabber instance may be running to support the decoration of FITS headers with relevant keyword information; for this purpose, there may be no need to retain data more than a week old. Setting the ‘purge’ and ‘expiration’ configuration options will enable this behavior.

Compression of archival logs

The archival logs occupy some 45 megabytes of disk space per million rows of recorded data. That compresses down to 8 megabytes, more than a factor of five, with aggressive LZMA compression. Writing out the archival logs without compressing them is a sure way to prematurely run out of disk space.

The compress_keygrabber_logs script automates the compression of archival log data, which is broken up into discrete files on a weekly basis. Invoking the script on a weekly basis is sufficient to ensure all log files are compressed appropriately. The sole argument provided on the command line is a directory that will be recursed; it is assumed that all files within that directory tree are keygrabber log files, and should be compressed if they haven’t been modified in more than seven days. Example invocation:

$KROOT/bin/compress_keygrabber_logs $KROOT/var/keygrabber

Performance warnings

If keygrabber is unable to keep up with the broadcast volume, one or both of the following will be true: messages will be logged indicating that keygrabber “can’t keep up with the broadcast rate” for a given service, and event records will not be immediately available for database retrieval.

If the “can’t keep up” log messages are a regular occurrence in the keygrabber syslog output, this indicates a serious performance problem: the service is broadcasting events at a high enough rate that the keygrabber daemon cannot do its most basic processing in the synchronous interval allocated for this purpose. If the message occurs once or twice in an extended time period (ten or fifteen seconds), this indicates that keygrabber was not able to keep up with a burst of requests, but that the rate decreased sufficiently for it to catch up.

After the initial receipt of broadcast events, the next significant bottleneck occurs when committing data to the relational database. The most direct symptom when this occurs is that regularly broadcast events are not available for prompt retrieval from the database. The next symptom is that the keygrabber process will consume an inordinate amount of resident memory; this occurs because the keygrabber places broadcast events into an unbounded queue for subsequent processing in a background thread. If the PostgreSQL daemon cannot keep up with the load, most likely due to insufficent disk bandwidth, this backlog can occur.

If a keygrabber instance has such a backlog, when asked to shut down, it will discontinue processing new events, but will attempt to fully process its backlog before shutting down.