[Nottingham-pm] Monitoring website uptimes

Wed Jul 30 15:39:10 PDT 2014

> 
> Interestingly there haven't been any failures at all in the last 24
> hours -- looking back further, I'm averaging 3-4 per day (of 720
> attempts) for MetaCPAN and closer to 1 for CPANsearch, almost never at
> the same time as one another (although if whichever I check first-
> which I seem to recall is randomised - times out, the other test
> starts up to 30 seconds later, so any local issue could have cleared
> up)
> 
> That makes me wonder if I should do the checks asynchronously and kick
> them all off as close together as possible. Hmm.

Makes me wonder if metacpan has a low maximum number of concurrent
connections and your failed connections just happen to hit a busy
period (such as when lots of mirrors are resynching).  A concurrency
problem might be down to limits on the webserver or backend DB.

> 
>>> Ok. I've modified the script (git patch attached, do as you will with it):
> 
> [some elided]
> 
>>>       c) to dump more HTTP information in the event of a failure;
> 
> I considered that; but I decided I didn't want data I'd have to
> manually process. I guess you're right though -- I should capture it
> somewhere, at least, for later reference...

I just want more data from at least one failure. After a few failures
the dump code can be disabled.

> 
>> Couple more patches as promised.  The first just removes an unnecessary
>> dependency.
> 
> Whoops. Leaving Data::Dumper lying around in production is a bad habit
> of mine...

It was me this time.  I have to admit though, I have used Data::Dumper
as part of the Logging and Exception handling in production code[1] and
would not be afraid to do so again. It is a really powerful tool for
debugging subtle problems, but like any powerful tool there are places
you can safely use them and there are times you should have known better.

[1] https://github.com/DuncanFyfe/application-toolkit-perl Msg.pm and
Exception.pm classes.

> 
>> The second adds new tables (results_2 and dumps_2) which
>> have an added hostid column so we can merge results.  It also
>> adds a quick bash script, with the necessary SQL, to copy data from
>> the results to results_2 table.
> 
> I'll take a closer look at this (probably at the weekend) but it
> sounds like a good idea -- more data will hide any minor anomalies,
> and testing from more locations will rule out any local issues. I had
> been trying to think of a (fair, sane) way to "smooth" the data by
> just dropping things that were clearly not real problems -- perhaps
> re-testing more frequently after a failure and ignoring it if the
> recovery was quick -- but this is probably a more sensible approach.
> 

I wouldn't invest too much more time in the script.  We can easily
filter the data as necessary into appropriate subsets.   The most
interesting data will be that from multiple machines close to a failure
on one machine.

The script as is should reveal symptoms eg: Are the observed failures
are specific to a machine or to particular times (eg. coinciding with
multiple mirrors synching) but I fear it will be difficult to get a
definitive cause without access to the webserver log files.

Have fun,
Duncan