[pm-h] complex data structure help

Tue Mar 28 09:31:25 PST 2006

I guess the problem I'm having is that I need to consolidate information. 
Since this is an NFS log, each line represents a file read or written. 
That's too much information (hundreds of MBs a day). I need to be able to to 
distill it to just summary information. I'm just not sure how to handle 
that. I figure that the smallest unit I'd have is what one user on one 
machine read or wrote on one filesystem during an hour.

Maybe a simple format:

filesystem user client-machine time-to-the-hour read written

Then for every line, I check to see if I have an entry that matches the 
first four parameters. If so, I add the number of bytes read or written. If 
not, I create a new entry. Then I can sort by whatever field I want, and 
limit my searches however I need to.

Only that seems inefficient. Could I normalize that somehow?

Paul

Yesterday, G. Wade Johnson wrote:

> I'm not familiar with the that log format, but taking the detabase suggestion
> one step in an odd direction, DBD::CSV might be able to put a relational
> database front-end on a log file.
>
> In the past, my normal approach for this sort of thing was an array of hashes.
> (One hash perl line) The array is easily sorted with usung 'sort' and can be
> filtered using 'grep'.
>
> Depending on how big the data set is and how complicated the query a database
> might be a better choice.
>
> G. Wade
>
> On Mon, 27 Mar 2006 17:41:39 -0600
> buu at erxz.com wrote:
>
>> On Mon, Mar 27, 2006 at 05:13:02PM -0600, Paul Archer wrote:
>>> I'm writing a log analyzer (a la Webalyzer) to analyze Solaris' nfslog
>>> files. They're in the same format as wu-ftpd xferlog files. I'd use an
>>> existing solution, but I can't find anything that keeps track of reads vs
>>> writes, which is critical for us.
>>> Anyway, I need to be able to sort by filesystem, client machine, user,
>>> time (with a one-hour base period) read, write, or total usage.
>>> Can anyone suggest a data structure (or pointers to same) that will allow
>>> me to pull data out in an arbitrary fashion (ie users on X day sorted by
>>> data written)?
>>> Once I have the structure, I can deal with doing the reports, but I want
>>> to make sure I don't shoot myself in the foot with the structure.
>>>
>>> I was thinking of a hash of hashes, where the keys are filesystems
>>> pointing to hashes where the keys are client machines, etc, etc. But it
>>> seems that approach would be inefficent for lookups based on times or
>>> users (for example).
>>>
>>> Any help would be greatly appreciated.
>>>
>>> Paul
>>> _______________________________________________
>>> Houston mailing list
>>> Houston at pm.org
>>> http://mail.pm.org/mailman/listinfo/houston
>>
>> Um. Have you considered a relational database? Sounds ideal for your
>> problem..
>> _______________________________________________
>> Houston mailing list
>> Houston at pm.org
>> http://mail.pm.org/mailman/listinfo/houston
>
>
> -- 
> No, no, you're not thinking, you're just being logical.
>                                                       -- Neils Bohr
> _______________________________________________
> Houston mailing list
> Houston at pm.org
> http://mail.pm.org/mailman/listinfo/houston
>

-----------------------------------------------------
"Somebody did say Swedish porn, there--
but someone always does..."
--Clive Anderson, host of "Whose Line Is It, Anyway",
after asking the audience for movie suggestions
-----------------------------------------------------