[oak perl] Comparing two files
Mark Bole
mark at bincomputing.com
Sun May 29 10:59:17 PDT 2005
Ditto on the tradeoffs. For relatively small files, just slurp them
both into arrays and use any of several well-documented techniques for
comparing them, as suggested.
Pre-sorting each file (array) will eliminate the need to re-seek. This
is what you do to use the Unix command "cmp", which makes it simpler
(but less powerful) than "diff" (or "fc" under Windows).
A more general approach is to think of each file as a relational
database table, and figure out what the primary (unique) key is for each
row (line). If there is a unique string in each line that is easy to
extract, great, otherwise something like an MD5 hash value (Digest::MD5)
for each line can be generated once for the large file (use it as the
key for a Perl hash) and then compared against the key (MD5 hash) for
each line of the other file.
In short, for the cost of some pre-processing (sorting and/or key
extraction), you shouldn't have to go through each file more than once.
With Unix you could also try something like this from the command line
(no Perl)
grep -v -F -f file1 file2
but I imagine it would choke on files over a certain size, or else take
a very long time. (on most Unixes, 'grep -F' and 'fgrep' are synonymous).
--Mark Bole
Michael Paoli wrote:
>A few items to consider.
>There are lots of ways to compare and look at differences among
>files - most notably beyond determining if the entire data contents are
>identical or not. That's really a topic unto itself. The source to diff(1)
>might be a useful/interesting place to start looking at that, and/or
>suitable information on various algorithms.
>
>If the size of the files is relatively small compared to the virtual
>memory available, it may be most/quite efficient to have perl read each
>of the entire files into arrays, and one can then handle, compare, etc.
>that data as desired, without need to reread the files.
>
>As for repositioning in a file, take a look at the seek perl function, and
>other related perl functions. If the files are quite large relative to
>the virtual memory available, this may be a preferable approach. The
>operating system may also help significantly with caching, so some/many
>logical rereads may not require physical rereading of on-disk data.
>
>I'd guestimate the more efficient approaches probably avoid rereading the
>files, or portions thereof ... but then there are always the tradeoffs
>between machine efficiency, programmer efficiency, and time, and for
>sufficiently small problem tasks, optimization may not be a significant
>factor.
>
>Quoting "M. Lewis" <cajun at cajuninc.com>:
>
>
>
>>my $shortfile;
>>my $longfile;
>>my $differences;
>>
>>
>>I'm writing a script to compare two text files ($shortfile & $longfile).
>>If a line appears in $shortfile, but that line is not in $longfile, then
>>I want to write that line out to $differences
>>
>>I'm relatively certain it is not efficient to open $longfile for each
>>entry in $shortfile. Both files are of the magnitude of 800+ lines.
>>
>>For example, a given line in $shortfile is found at line 333 in
>>$longfile. Without closing and reopening $longfile, I don't know how to
>>reset the 'pointer' in $longfile back to line 1.
>>
>>Perhaps there is a better way of doing this. I hope I've explained what
>>I'm trying to do clearly.
>>
>>Suggestions ?
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.pm.org/pipermail/oakland/attachments/20050529/1cae605a/attachment.html
More information about the Oakland
mailing list