[sf-perl] Server downtime reporting and recovery

Michael Friedman friedman at highwire.stanford.edu
Tue Feb 24 20:53:50 PST 2009

I'd also like to recommend Nagios. My group uses it to monitor  
everything and, except for when you're setting up a new kind of  
monitor, it "just works". We even use it to call out to perl scripts  
to monitor certain special things.

The trick with any monitoring software is to monitor the right things  
as often as you need to without hurting those things. So, sure!, you  
can monitor the website every 30 seconds if you want, but you really  
need that hit to be as small of an impact as possible or you'll cause  
your own performance problems.

For example:
- if a static page GET will work, don't GET a CGI page
- if a CGI page will work, then don't GET a DB-driven page
- if you can monitor the causes of failures that's better than the  
failures themselves

One great thing about Nagios for us is that it's easy to set up new  
monitors. So the first time we had a machine run out of disk space  
(oops!) we set up disk space monitors on all the servers. They email  
someone when the server hits 80% full and page someone at 90%. We have  
CPU monitors that make sure that any process that is taking too much  
time gets killed and restarted. We watch home pages and access- 
controlled pages and java servlet-served pages. (Rather than tie up a  
real servlet which would take more resources we made a "Hello, World"  
servlet. If the servlet container is running, it responds. If it  
isn't, it won't. But it's the lightest thing we could hit and still  
know anything about the java servlet status. Watch for caching, though!)

Anyway, Nagios is pretty easy to administer and really flexible. Once  
you get used to it, you'll think of all sorts of things you want to  

-- Mike

PS - One last thing. We discovered that having a monitor restart  
processes was fine, as far as it goes, but not enough. Now we monitor  
the number of restarts that happen. If the automated monitor restarts  
the service three times in a row, then another monitor pages someone  
-- since there's obviously something that's stopping the service from  
coming back up.
Mike Friedman | HighWire Press, Stanford Univ | friedman at highwire.stanford.edu

On Feb 24, 2009, at 5:19 PM, Mason Jones wrote:

> I'd say yes, you're going to want to be looking at something like  
> Nagios or Big Brother, which can check the status of known services/ 
> ports/web apps/etc as frequently as you need, and then invoke  
> scripts to restart things. You can actually get quite a bit done  
> with home-grown perl scripts, really, but there are plugins and  
> other things available for tools like Nagios which you'll probably  
> find save you time (once you learn the system, of course).
> On Tue, Feb 24, 2009 at 5:13 PM, Matt Barkovich <barko192 at gmail.com>  
> wrote:
> Hi all,
> I was curious about how those of you who work with web aps deal with
> minimizing downtime when a particular service dies for whatever
> reason.  I'm not a sysadmin by training, rather it is a responsibility
> that no one else seemed willing to take.  Right now I have a perl
> script that runs as a cron job every five minutes, checking the status
> of the various services on the server and restarting and reporting if
> anything is amiss.
> I've been told that my production schedule needs to be pushed forward
> and five minutes of downtime will soon be unacceptable.  Since I've
> got a .NET app running in mono (which has not been kind to me) I need
> to catch problems as quickly as possible and restart the service.
> Most frequently the mono app will just hang indefinitely, not crash
> outright.  With the new schedule I don't have time to fix (read
> replace) the problematic app before I go live.
> So my question, what do you folks recommend as far as checking the
> status of services more frequently than every 5 minutes?  Would you
> recommend sticking with perl, or this there some FOSS that would
> better serve my purposes?  In my research, I've found programs like
> Nagios, but don't know much about them. I'd prefer not to add too much
> the way of overhead, but I also don't want to reinvent the wheel.
> Sorry if this is a little off topic.
> Thanks,
> Matt
> _______________________________________________
> SanFrancisco-pm mailing list
> SanFrancisco-pm at pm.org
> http://mail.pm.org/mailman/listinfo/sanfrancisco-pm
> _______________________________________________
> SanFrancisco-pm mailing list
> SanFrancisco-pm at pm.org
> http://mail.pm.org/mailman/listinfo/sanfrancisco-pm

More information about the SanFrancisco-pm mailing list