[Nottingham-pm] Monitoring website uptimes

Wed Jul 30 14:47:15 PDT 2014

Hi all, thanks for the responses!

On 30 July 2014 16:56, Duncan Fyfe <duncanfyfe at domenlas.com> wrote:
> On 30/07/14 15:11, Duncan Fyfe wrote:
>> On 30/07/14 00:30, Duncan Fyfe wrote:

>>> Back to your test script.  How frequent are the failures or put another
>>> way, how many times would you expect to have to run it before you saw
>>> a failure ?

Interestingly there haven't been any failures at all in the last 24
hours -- looking back further, I'm averaging 3-4 per day (of 720
attempts) for MetaCPAN and closer to 1 for CPANsearch, almost never at
the same time as one another (although if whichever I check first-
which I seem to recall is randomised - times out, the other test
starts up to 30 seconds later, so any local issue could have cleared
up)

That makes me wonder if I should do the checks asynchronously and kick
them all off as close together as possible. Hmm.

>> Ok. I've modified the script (git patch attached, do as you will with it):

[some elided]

>>       c) to dump more HTTP information in the event of a failure;

I considered that; but I decided I didn't want data I'd have to
manually process. I guess you're right though -- I should capture it
somewhere, at least, for later reference...

> Couple more patches as promised.  The first just removes an unnecessary
> dependency.

Whoops. Leaving Data::Dumper lying around in production is a bad habit
of mine...

> The second adds new tables (results_2 and dumps_2) which
> have an added hostid column so we can merge results.  It also
> adds a quick bash script, with the necessary SQL, to copy data from
> the results to results_2 table.

I'll take a closer look at this (probably at the weekend) but it
sounds like a good idea -- more data will hide any minor anomalies,
and testing from more locations will rule out any local issues. I had
been trying to think of a (fair, sane) way to "smooth" the data by
just dropping things that were clearly not real problems -- perhaps
re-testing more frequently after a failure and ignoring it if the
recovery was quick -- but this is probably a more sensible approach.

Thanks again,

James