[pm-h] complex data structure help
Paul Archer
tigger at io.com
Tue Mar 28 09:31:25 PST 2006
I guess the problem I'm having is that I need to consolidate information.
Since this is an NFS log, each line represents a file read or written.
That's too much information (hundreds of MBs a day). I need to be able to to
distill it to just summary information. I'm just not sure how to handle
that. I figure that the smallest unit I'd have is what one user on one
machine read or wrote on one filesystem during an hour.
Maybe a simple format:
filesystem user client-machine time-to-the-hour read written
Then for every line, I check to see if I have an entry that matches the
first four parameters. If so, I add the number of bytes read or written. If
not, I create a new entry. Then I can sort by whatever field I want, and
limit my searches however I need to.
Only that seems inefficient. Could I normalize that somehow?
Paul
Yesterday, G. Wade Johnson wrote:
> I'm not familiar with the that log format, but taking the detabase suggestion
> one step in an odd direction, DBD::CSV might be able to put a relational
> database front-end on a log file.
>
> In the past, my normal approach for this sort of thing was an array of hashes.
> (One hash perl line) The array is easily sorted with usung 'sort' and can be
> filtered using 'grep'.
>
> Depending on how big the data set is and how complicated the query a database
> might be a better choice.
>
> G. Wade
>
> On Mon, 27 Mar 2006 17:41:39 -0600
> buu at erxz.com wrote:
>
>> On Mon, Mar 27, 2006 at 05:13:02PM -0600, Paul Archer wrote:
>>> I'm writing a log analyzer (a la Webalyzer) to analyze Solaris' nfslog
>>> files. They're in the same format as wu-ftpd xferlog files. I'd use an
>>> existing solution, but I can't find anything that keeps track of reads vs
>>> writes, which is critical for us.
>>> Anyway, I need to be able to sort by filesystem, client machine, user,
>>> time (with a one-hour base period) read, write, or total usage.
>>> Can anyone suggest a data structure (or pointers to same) that will allow
>>> me to pull data out in an arbitrary fashion (ie users on X day sorted by
>>> data written)?
>>> Once I have the structure, I can deal with doing the reports, but I want
>>> to make sure I don't shoot myself in the foot with the structure.
>>>
>>> I was thinking of a hash of hashes, where the keys are filesystems
>>> pointing to hashes where the keys are client machines, etc, etc. But it
>>> seems that approach would be inefficent for lookups based on times or
>>> users (for example).
>>>
>>> Any help would be greatly appreciated.
>>>
>>> Paul
>>> _______________________________________________
>>> Houston mailing list
>>> Houston at pm.org
>>> http://mail.pm.org/mailman/listinfo/houston
>>
>> Um. Have you considered a relational database? Sounds ideal for your
>> problem..
>> _______________________________________________
>> Houston mailing list
>> Houston at pm.org
>> http://mail.pm.org/mailman/listinfo/houston
>
>
> --
> No, no, you're not thinking, you're just being logical.
> -- Neils Bohr
> _______________________________________________
> Houston mailing list
> Houston at pm.org
> http://mail.pm.org/mailman/listinfo/houston
>
-----------------------------------------------------
"Somebody did say Swedish porn, there--
but someone always does..."
--Clive Anderson, host of "Whose Line Is It, Anyway",
after asking the audience for movie suggestions
-----------------------------------------------------
More information about the Houston
mailing list