[oak perl] Comparing two files
M. Lewis
cajun at cajuninc.com
Sun May 29 16:16:15 PDT 2005
Thanks Mark and Michael for the thoughts / ideas. I'm not terribly
concerned about efficiency as the script isn't going to get used much.
Perhaps only a few times.
Reading the files once and putting them in an array sounds like the most
efficient method though. I'll give that some thought.
Thanks again,
Mike
Mark Bole wrote:
> Ditto on the tradeoffs. For relatively small files, just slurp them
> both into arrays and use any of several well-documented techniques for
> comparing them, as suggested.
>
> Pre-sorting each file (array) will eliminate the need to re-seek. This
> is what you do to use the Unix command "cmp", which makes it simpler
> (but less powerful) than "diff" (or "fc" under Windows).
>
> A more general approach is to think of each file as a relational
> database table, and figure out what the primary (unique) key is for each
> row (line). If there is a unique string in each line that is easy to
> extract, great, otherwise something like an MD5 hash value (Digest::MD5)
> for each line can be generated once for the large file (use it as the
> key for a Perl hash) and then compared against the key (MD5 hash) for
> each line of the other file.
>
> In short, for the cost of some pre-processing (sorting and/or key
> extraction), you shouldn't have to go through each file more than once.
>
> With Unix you could also try something like this from the command line
> (no Perl)
>
> grep -v -F -f file1 file2
>
> but I imagine it would choke on files over a certain size, or else take
> a very long time. (on most Unixes, 'grep -F' and 'fgrep' are synonymous).
>
> --Mark Bole
>
> Michael Paoli wrote:
>
>>A few items to consider.
>>There are lots of ways to compare and look at differences among
>>files - most notably beyond determining if the entire data contents are
>>identical or not. That's really a topic unto itself. The source to diff(1)
>>might be a useful/interesting place to start looking at that, and/or
>>suitable information on various algorithms.
>>
>>If the size of the files is relatively small compared to the virtual
>>memory available, it may be most/quite efficient to have perl read each
>>of the entire files into arrays, and one can then handle, compare, etc.
>>that data as desired, without need to reread the files.
>>
>>As for repositioning in a file, take a look at the seek perl function, and
>>other related perl functions. If the files are quite large relative to
>>the virtual memory available, this may be a preferable approach. The
>>operating system may also help significantly with caching, so some/many
>>logical rereads may not require physical rereading of on-disk data.
>>
>>I'd guestimate the more efficient approaches probably avoid rereading the
>>files, or portions thereof ... but then there are always the tradeoffs
>>between machine efficiency, programmer efficiency, and time, and for
>>sufficiently small problem tasks, optimization may not be a significant
>>factor.
>>
>>Quoting "M. Lewis" <cajun at cajuninc.com>:
>>
>>
>>
>>>my $shortfile;
>>>my $longfile;
>>>my $differences;
>>>
>>>
>>>I'm writing a script to compare two text files ($shortfile & $longfile).
>>>If a line appears in $shortfile, but that line is not in $longfile, then
>>>I want to write that line out to $differences
>>>
>>>I'm relatively certain it is not efficient to open $longfile for each
>>>entry in $shortfile. Both files are of the magnitude of 800+ lines.
>>>
>>>For example, a given line in $shortfile is found at line 333 in
>>>$longfile. Without closing and reopening $longfile, I don't know how to
>>>reset the 'pointer' in $longfile back to line 1.
>>>
>>>Perhaps there is a better way of doing this. I hope I've explained what
>>>I'm trying to do clearly.
>>>
>>>Suggestions ?
>>>
>>>
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Oakland mailing list
> Oakland at pm.org
> http://mail.pm.org/mailman/listinfo/oakland
--
A computer program does what you tell it to do, not what you want it
to do.
18:14:01 up 4 days, 31 min, 5 users, load average: 0.25, 0.08, 0.02
Linux Registered User #241685 http://counter.li.org
More information about the Oakland
mailing list