<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1">

  <title></title>

</head>

<body text="#000000" bgcolor="#ffffff">

Ditto on the tradeoffs.&nbsp; For relatively small files, just slurp them

both into arrays and use any of several well-documented techniques for

comparing them, as suggested.<br>

<br>

Pre-sorting each file (array) will eliminate the need to re-seek.&nbsp; This

is what you do to use the Unix command "cmp", which makes it simpler

(but less powerful) than "diff" (or "fc" under Windows).<br>

<br>

A more general approach is to think of each file as a relational

database table, and figure out what the primary (unique) key is for

each row (line).&nbsp; If there is a unique string in each line that is easy

to extract, great, otherwise something like an MD5 hash value (<tt>Digest::MD5</tt>)

for each line can be generated once for the large file (use it as the

key for a Perl hash) and then compared against the key (MD5 hash) for

each line of the other file.<br>

<br>

In short, for the cost of some pre-processing (sorting and/or key

extraction), you shouldn't have to go through each file more than once.<br>

<br>

With Unix you could also try something like this from the command line

(no Perl)<br>

<br>

<tt>grep -v -F -f file1 file2</tt><br>

<br>

but I imagine it would choke on files over a certain size, or else take

a very long time. (on most Unixes, 'grep -F' and 'fgrep' are

synonymous).<br>

<br>

--Mark Bole<br>

<br>

Michael Paoli wrote:<br>

<blockquote type="cite"

 cite="mid1117377554.4299d412d9d0e@webmail.rawbw.com">

  <pre wrap="">A few items to consider.

There are lots of ways to compare and look at differences among

files - most notably beyond determining if the entire data contents are

identical or not.  That's really a topic unto itself.  The source to diff(1)

might be a useful/interesting place to start looking at that, and/or

suitable information on various algorithms.

If the size of the files is relatively small compared to the virtual

memory available, it may be most/quite efficient to have perl read each

of the entire files into arrays, and one can then handle, compare, etc.

that data as desired, without need to reread the files.

As for repositioning in a file, take a look at the seek perl function, and

other related perl functions.  If the files are quite large relative to

the virtual memory available, this may be a preferable approach.  The

operating system may also help significantly with caching, so some/many

logical rereads may not require physical rereading of on-disk data.

I'd guestimate the more efficient approaches probably avoid rereading the

files, or portions thereof ... but then there are always the tradeoffs

between machine efficiency, programmer efficiency, and time, and for

sufficiently small problem tasks, optimization may not be a significant

factor.

Quoting "M. Lewis" <a class="moz-txt-link-rfc2396E" href="mailto:cajun@cajuninc.com">&lt;cajun@cajuninc.com&gt;</a>:

  </pre>

  <blockquote type="cite">

    <pre wrap="">my $shortfile;

my $longfile;

my $differences;

I'm writing a script to compare two text files ($shortfile &amp; $longfile). 

If a line appears in $shortfile, but that line is not in $longfile, then 

I want to write that line out to $differences

I'm relatively certain it is not efficient to open $longfile for each 

entry in $shortfile. Both files are of the magnitude of 800+ lines.

For example, a given line in $shortfile is found at line 333 in 

$longfile. Without closing and reopening $longfile, I don't know how to 

reset the 'pointer' in $longfile back to line 1.

Perhaps there is a better way of doing this. I hope I've explained what 

I'm trying to do clearly.

Suggestions ?

    </pre>

  </blockquote>

  <pre wrap=""><!----></pre>

</blockquote>

<br>

<pre class="moz-signature" cols="72">

</pre>

</body>

</html>