[pm-h] [SPAM] [PBML] complex data structure help

Kevin Shaum kevin at shaum.com
Tue Mar 28 01:59:32 PST 2006


On Monday 27 March 2006 5:13 pm, Paul Archer wrote:
> I'm writing a log analyzer (a la Webalyzer) to analyze Solaris' nfslog
> files. They're in the same format as wu-ftpd xferlog files. I'd use an
> existing solution, but I can't find anything that keeps track of reads vs
> writes, which is critical for us.
> Anyway, I need to be able to sort by filesystem, client machine, user, time
> (with a one-hour base period) read, write, or total usage.
> Can anyone suggest a data structure (or pointers to same) that will allow
> me to pull data out in an arbitrary fashion (ie users on X day sorted by
> data written)?
> Once I have the structure, I can deal with doing the reports, but I want to
> make sure I don't shoot myself in the foot with the structure.
>
> I was thinking of a hash of hashes, where the keys are filesystems pointing
> to hashes where the keys are client machines, etc, etc. But it seems that
> approach would be inefficent for lookups based on times or users (for
> example).

The simplest thing to do would be to store it all as a simple list of 
(references to) lists, then 'grep' and 'sort' the big list as the query 
requires.

@result = sort { $a->[1] lt $b->[1] }
          grep { $_->[2] >= $time0 and $_->[2] <= $time1 }
          grep { $_->[0] eq 'myhost' }
          @dataset;

A more readable (but possibly less efficient) version would store each entry 
in the big list as (a reference to) a hash:

@result = sort { $a->{username} lt $b->{username1} }
          grep { $_->{time} >= $time0 and $_->{time} < $time1 }
          grep { $_->{hostname} eq 'myhost' }
          @dataset;

If the data set is large enough that that's not practical, then the suggestion 
to go to a relational database (e.g., SQLite) makes sense. But it sounds like 
you're thinking of keeping it all in RAM anyway.

Hope this helps.

Kevin


More information about the Houston mailing list