[Nottingham-pm] Monitoring website uptimes
jim.a.driscoll at gmail.com
Tue Jul 29 12:33:18 PDT 2014
> On 29 Jul 2014, at 19:45, Duncan Fyfe <duncanfyfe at domenlas.com> wrote:
>> On 29/07/14 18:35, James Green wrote:
>> Hey folks,
>> Rather than my usual meeting-arranging blather on this list ... I have
>> an actual Perl-related question! OK, it's not very Perl related.
>> Following a bunch of recent conversations about the future of
>> search.cpan.org, and the fact it was seemingly down all the time, I've
>> started gathering stats on when both it, and metacpan.org, are
>> Unfortunately I'm getting a lot of what I suspect are false positives.
>> I'm using LWP::UserAgent to get() a specific search page from each
>> site, timing out after 30s, and if it hasn't loaded, considering it
>> "down" until the next check. This process runs every 2 minutes, from
>> cron. Quite often a site will fail to load just once, then be back up
>> the next time -- which is as likely to be a transient routing problem
>> at my end as an issue at theirs.
>> Does anyone have experience monitoring the availability of websites,
>> or exciting ideas for better approaches to this data?
> Quick check, details below, but for starters it looks like there might
> be a reverse DNS problem with metacpan.org. I'll have a more detailed
> look later.
Just a misconfiguration on one of the servers, unlikely to be anything to do with reverse DNS at all, and certainly that would explain why it breaks "sometimes". For monitoring purposes there should be a connected IP address and port as properties of HTTP::Response (peeraddr/peerport maybe?), so you should log those on success or failure to identify if there is a bad server.
You can persuade LWP::UserAgent to connect to a specified IP address via its proxy functionality I think, so looping over all IP addresses it maps to on each test cycle would be viable and useful.
On the subject of eliminating local connectivity problems, just ensure that there is also a control (just some unrelated site or sites) which you're also monitoring at the same time - if the control is also down then it's probably your connection.
More information about the Nottingham-pm