[sf-perl] Server downtime reporting and recovery

Matt Barkovich barko192 at gmail.com
Tue Feb 24 17:13:57 PST 2009


Hi all,

I was curious about how those of you who work with web aps deal with
minimizing downtime when a particular service dies for whatever
reason.  I'm not a sysadmin by training, rather it is a responsibility
that no one else seemed willing to take.  Right now I have a perl
script that runs as a cron job every five minutes, checking the status
of the various services on the server and restarting and reporting if
anything is amiss.

I've been told that my production schedule needs to be pushed forward
and five minutes of downtime will soon be unacceptable.  Since I've
got a .NET app running in mono (which has not been kind to me) I need
to catch problems as quickly as possible and restart the service.
Most frequently the mono app will just hang indefinitely, not crash
outright.  With the new schedule I don't have time to fix (read
replace) the problematic app before I go live.

So my question, what do you folks recommend as far as checking the
status of services more frequently than every 5 minutes?  Would you
recommend sticking with perl, or this there some FOSS that would
better serve my purposes?  In my research, I've found programs like
Nagios, but don't know much about them. I'd prefer not to add too much
the way of overhead, but I also don't want to reinvent the wheel.

Sorry if this is a little off topic.

Thanks,

Matt


More information about the SanFrancisco-pm mailing list