Heartbeat Monitors
Add a heartbeat monitor to make sure you keep on receiving data.
It makes sense to monitor the status of a system, but what happens when no data is received by the monitoring system? MonaLisa doesn't naturally have a way of querying "When did I last hear from the headnode?", but a heartbeat monitor through the alerts system can keep track of precisely this. The format of a heartbeat filter is this:
<heartbeat farm="My Farm" cluster="My Cluster" node="My Node" param="My Param" timeout="300">If the timeout value - specified in seconds - is not given, then it defaults to 300. The best place to use the heartbeat is to check and see if your MonaLisa service is still alive from another MonaLisa service; in Nagios terminology, peering. For example, on our osg-test2 host, we have the following Alert.xml:
<!-- list actions to take here -->
</heartbeat>
<?xml version="1.0" encoding="UTF-8" ?>
<filters>
<heartbeat farm="red.unl.edu" cluster="MonaLisa" node="localhost-gone" param="Load5" timeout="40">
<print> Cluster headnode is down! </print>
<email>
<from>alerts@osg-test2.unl.edu</from>
<to>some-address@your-site.edu</to>
<subject>$FARM is down.</subject>
<text>This is an automated MonaLisa alert. To filter it out, simply filter any messages containing "FILTER_MONALISA".
$FARM has stopped sending heartbeat messages. Please check.
</text>
</email>
</heartbeat>
</filters>
This will let us know if our headnode has gone down.