rapidscaleclusters.com

The "Mon" Monitoring System for Linux

When cluster building, one issue is how to implement health checking and monitoring. So many fine choices...

A partial list would include Ganglia, Nagios, Mon, KeepAlived, and cooking up something new that could be done from scratch. In the end, I picked Mon because it seemed so flexible... It could be bent to my will and made to perform anything my heart desired!    :-)  So here's a Mon howto based on what I learned. 

It's a great piece of software. The way it works is simple:
1. There is a "monitor" which does some kind of test and has exit code either 0 or 1 
2. There is an "alert" which gets called when the exit code is 1
3. There is a config file that specifies which monitors to run, how often, which alerts get called for each, and which nodes they apply to. 

The best part is that pretty much everything is written in scripting languages, and you can write your own alerts, monitors, whatever, in any language you want. They just have to exit 0 (successful) or 1 (failed). And the package comes with dozens of monitors and alerts for checking common things like http services, smtp, etc. 

Mon doesn't have all the bells and whistles of Nagios, but it's rock solid, simple, and very clean. It was the perfect tool for the job: checking on a few different ports and having a highly customizable response. It was almost too easy to write the bash scripts I wanted to be called when a Mon monitor failed a test.... And even better, Mon has alerts that will take the error and email it to you. 

Here's a snippet showing the relevent part of a mon.cf config file to make Mon tick:

hostgroup http_nodes 10.0.0.35
watch http_nodes
    service http_failures
        description Checks port 80 for http failures
        interval 60s   ## Run every 60 seconds
        randskew 5s  ## add / subtract a random skew on 60 seconds
        monitor http.monitor    ## this script connects to port 80 and exits 0 if http response code OK
        period hr {12am-11pm}   ## When to run. This means all the time: 12:00:00 through 11:59:59
        alert apache-restart.alert   ## This can be a simple script like:  '/etc/init.d/httpd restart'
        alert mail.alert This e-mail address is being protected from spambots. You need JavaScript enabled to view it  ## Email me and tell me about the failure
        numalerts 1   ## Alert only once. Subsequent failues are ignored until the monitor passes again, then this gets reset

Pretty cool little program! It's flexibility is key. In minutes, you can make it do anything.  
 

Add your comment

Your name:
Subject:
Comment:

Subscribe to the Linux Admins Blog and get new posts delivered by email!
Enter your email address:

Delivered by FeedBurner


Tell the developers:

The type of clustering you are most likely to deploy is:
 
What Linux distro do you use for clusters?
 

Copyright 2010    RapidScale Clusters, LLC