[oak perl] Comparing two files

Mark Bole mark at bincomputing.com
Sun May 29 10:59:17 PDT 2005


Ditto on the tradeoffs.  For relatively small files, just slurp them 
both into arrays and use any of several well-documented techniques for 
comparing them, as suggested.

Pre-sorting each file (array) will eliminate the need to re-seek.  This 
is what you do to use the Unix command "cmp", which makes it simpler 
(but less powerful) than "diff" (or "fc" under Windows).

A more general approach is to think of each file as a relational 
database table, and figure out what the primary (unique) key is for each 
row (line).  If there is a unique string in each line that is easy to 
extract, great, otherwise something like an MD5 hash value (Digest::MD5) 
for each line can be generated once for the large file (use it as the 
key for a Perl hash) and then compared against the key (MD5 hash) for 
each line of the other file.

In short, for the cost of some pre-processing (sorting and/or key 
extraction), you shouldn't have to go through each file more than once.

With Unix you could also try something like this from the command line 
(no Perl)

grep -v -F -f file1 file2

but I imagine it would choke on files over a certain size, or else take 
a very long time. (on most Unixes, 'grep -F' and 'fgrep' are synonymous).

--Mark Bole

Michael Paoli wrote:

>A few items to consider.
>There are lots of ways to compare and look at differences among
>files - most notably beyond determining if the entire data contents are
>identical or not.  That's really a topic unto itself.  The source to diff(1)
>might be a useful/interesting place to start looking at that, and/or
>suitable information on various algorithms.
>
>If the size of the files is relatively small compared to the virtual
>memory available, it may be most/quite efficient to have perl read each
>of the entire files into arrays, and one can then handle, compare, etc.
>that data as desired, without need to reread the files.
>
>As for repositioning in a file, take a look at the seek perl function, and
>other related perl functions.  If the files are quite large relative to
>the virtual memory available, this may be a preferable approach.  The
>operating system may also help significantly with caching, so some/many
>logical rereads may not require physical rereading of on-disk data.
>
>I'd guestimate the more efficient approaches probably avoid rereading the
>files, or portions thereof ... but then there are always the tradeoffs
>between machine efficiency, programmer efficiency, and time, and for
>sufficiently small problem tasks, optimization may not be a significant
>factor.
>
>Quoting "M. Lewis" <cajun at cajuninc.com>:
>
>  
>
>>my $shortfile;
>>my $longfile;
>>my $differences;
>>
>>
>>I'm writing a script to compare two text files ($shortfile & $longfile). 
>>If a line appears in $shortfile, but that line is not in $longfile, then 
>>I want to write that line out to $differences
>>
>>I'm relatively certain it is not efficient to open $longfile for each 
>>entry in $shortfile. Both files are of the magnitude of 800+ lines.
>>
>>For example, a given line in $shortfile is found at line 333 in 
>>$longfile. Without closing and reopening $longfile, I don't know how to 
>>reset the 'pointer' in $longfile back to line 1.
>>
>>Perhaps there is a better way of doing this. I hope I've explained what 
>>I'm trying to do clearly.
>>
>>Suggestions ?
>>    
>>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.pm.org/pipermail/oakland/attachments/20050529/1cae605a/attachment.html


More information about the Oakland mailing list