[oak perl] Comparing two files

M. Lewis cajun at cajuninc.com
Sun May 29 16:16:15 PDT 2005


Thanks Mark and Michael for the thoughts / ideas. I'm not terribly 
concerned about efficiency as the script isn't going to get used much. 
Perhaps only a few times.

Reading the files once and putting them in an array sounds like the most 
efficient method though. I'll give that some thought.

Thanks again,
Mike

Mark Bole wrote:
> Ditto on the tradeoffs.  For relatively small files, just slurp them 
> both into arrays and use any of several well-documented techniques for 
> comparing them, as suggested.
> 
> Pre-sorting each file (array) will eliminate the need to re-seek.  This 
> is what you do to use the Unix command "cmp", which makes it simpler 
> (but less powerful) than "diff" (or "fc" under Windows).
> 
> A more general approach is to think of each file as a relational 
> database table, and figure out what the primary (unique) key is for each 
> row (line).  If there is a unique string in each line that is easy to 
> extract, great, otherwise something like an MD5 hash value (Digest::MD5) 
> for each line can be generated once for the large file (use it as the 
> key for a Perl hash) and then compared against the key (MD5 hash) for 
> each line of the other file.
> 
> In short, for the cost of some pre-processing (sorting and/or key 
> extraction), you shouldn't have to go through each file more than once.
> 
> With Unix you could also try something like this from the command line 
> (no Perl)
> 
> grep -v -F -f file1 file2
> 
> but I imagine it would choke on files over a certain size, or else take 
> a very long time. (on most Unixes, 'grep -F' and 'fgrep' are synonymous).
> 
> --Mark Bole
> 
> Michael Paoli wrote:
> 
>>A few items to consider.
>>There are lots of ways to compare and look at differences among
>>files - most notably beyond determining if the entire data contents are
>>identical or not.  That's really a topic unto itself.  The source to diff(1)
>>might be a useful/interesting place to start looking at that, and/or
>>suitable information on various algorithms.
>>
>>If the size of the files is relatively small compared to the virtual
>>memory available, it may be most/quite efficient to have perl read each
>>of the entire files into arrays, and one can then handle, compare, etc.
>>that data as desired, without need to reread the files.
>>
>>As for repositioning in a file, take a look at the seek perl function, and
>>other related perl functions.  If the files are quite large relative to
>>the virtual memory available, this may be a preferable approach.  The
>>operating system may also help significantly with caching, so some/many
>>logical rereads may not require physical rereading of on-disk data.
>>
>>I'd guestimate the more efficient approaches probably avoid rereading the
>>files, or portions thereof ... but then there are always the tradeoffs
>>between machine efficiency, programmer efficiency, and time, and for
>>sufficiently small problem tasks, optimization may not be a significant
>>factor.
>>
>>Quoting "M. Lewis" <cajun at cajuninc.com>:
>>
>>  
>>
>>>my $shortfile;
>>>my $longfile;
>>>my $differences;
>>>
>>>
>>>I'm writing a script to compare two text files ($shortfile & $longfile). 
>>>If a line appears in $shortfile, but that line is not in $longfile, then 
>>>I want to write that line out to $differences
>>>
>>>I'm relatively certain it is not efficient to open $longfile for each 
>>>entry in $shortfile. Both files are of the magnitude of 800+ lines.
>>>
>>>For example, a given line in $shortfile is found at line 333 in 
>>>$longfile. Without closing and reopening $longfile, I don't know how to 
>>>reset the 'pointer' in $longfile back to line 1.
>>>
>>>Perhaps there is a better way of doing this. I hope I've explained what 
>>>I'm trying to do clearly.
>>>
>>>Suggestions ?
>>>    
>>>
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Oakland mailing list
> Oakland at pm.org
> http://mail.pm.org/mailman/listinfo/oakland

-- 

  A computer program does what you tell it to do, not what you want it 
to do.
  18:14:01 up 4 days, 31 min,  5 users,  load average: 0.25, 0.08, 0.02

  Linux Registered User #241685  http://counter.li.org


More information about the Oakland mailing list