<br>Thank you both for taking the time!<br><br>A smaller test dump is divide-and-conquer...I should&#39;ve thought of<br>this, but I was challenged with trying to slam out a script in a half-day.<br><br>I will ponder the techs suggested Tony.<br>

<br>Thank you again.  A good word goes a long way.<br><br>Chris<br><br><div class="gmail_quote">On Thu, Nov 11, 2010 at 1:26 AM, Anthony Foiani <span dir="ltr">&lt;<a href="mailto:tkil@scrye.com">tkil@scrye.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><br>

Christopher --<br>

<div class="im"><br>

Christopher Hahn &lt;<a href="mailto:xrz1138@gmail.com">xrz1138@gmail.com</a>&gt; writes:<br>

&gt; I am trying to parse a huge (7 Gb) file that is line oriented but<br>

&gt; has large sections that are any kind of binary character.<br>

</div>&gt; [...]<br>

<div class="im">&gt; I am sure that there are odd chars in the file that are doing this....<br>

<br>

</div>That&#39;s really unlikely.  What makes you sure?<br>

<br>

You could try doing a simple replace of all non-printable chars with,<br>

say &quot;!&quot;, and see if it still chokes.<br>

<div class="im"><br>

&gt; I tried setting binmode on the input file handle, and just loading<br>

&gt; the entire file into a buffer, just as a test, as we have enough<br>

&gt; memory to do this.<br>

<br>

</div>And for 7GB, you really want to use &#39;mmap&#39;, not &quot;read the buffer in&quot;.<br>

There are Perl modules to do that, although I&#39;ve never used them.<br>

<br>

Oooooh!<br>

<br>

  &quot;Note that PerlIO now defines a :mmap tag and presents mmap&#39;d files<br>

  as regular files, if that is your cup of joe.&quot;<br>

    -- <a href="http://search.cpan.org/%7Etoddr/Sys-Mmap-0.14/Mmap.pm" target="_blank">http://search.cpan.org/~toddr/Sys-Mmap-0.14/Mmap.pm</a><br>

<div class="im"><br>

&gt; I watched using &quot;top&quot; and after the memory used climbed to a tad<br>

&gt; more than the size of the file, the &quot;Killed&quot; message appeared and<br>

&gt; the console closed itself.<br>

<br>

</div>Just to check, did you turn off the relevant limits?  &quot;ulimit -a&quot;<br>

<br>

What kind of system is this?  If it is a non-ancient Linux system,<br>

there is the &quot;Out Of Memory Killer&quot; (&quot;OOM Killer&quot;) to contend with.<br>

Looking through the logs should point fingers if that&#39;s the case.<br>

<br>

That&#39;s pretty much the only thing I can think about that would kill<br>

your console session, actually; normally perl itself would just give<br>

up and/or choke, and return control to your tty.<br>

<div class="im"><br>

&gt; I have to stay at work until this is done, and so am just hoping the<br>

&gt; someone if online and can give me the kick in the head that I need.<br>

<br>

</div>You can also use &quot;strace&quot; on linux (and &quot;truss&quot; on solaris and bsd?);<br>

that might tell you which system call actually failed.  You probably<br>

want to send output to a file, since having it sent to a terminal that<br>

is disconnected is not horribly useful.<br>

<br>

  strace -o boom.txt -f -tt ./my-perl-script-here.plx ...<br>

<br>

Finally, another way to deal with arbitrarily large but still<br>

&quot;line-oriented&quot; strings is to read in chunks (say, 4096 bytes), look<br>

for any complete structures, and then keep the &quot;tail&quot; from the<br>

previous blocks, append the next 4096 bytes, etc:<br>

<br>

    # what separates records?<br>

    my $sep_re = qr/xyzzy/;<br>

<br>

    # how much to read at a time?<br>

    my $chunk_size = 1 &lt;&lt; 16;<br>

<br>

    # read in chunks at a time, split into records, continue.<br>

    my $buf;<br>

    while ( my $n_read = read IN, $buf, $chunk_size, length $buf )<br>

    {<br>

        # split into records, preserving any leading/trailing nulls<br>

        my @recs = split $sep_re, $buf, -1;<br>

<br>

        # store any trailer into the buffer for next read<br>

        $buf = pop @recs;<br>

<br>

        # now process the records that have been found<br>

        foreach my $rec ( @recs )<br>

        {<br>

            transmogrify $rec;<br>

        }<br>

    }<br>

<br>

    # handle end-of-file stragglers<br>

    if ( length( $buf ) )<br>

    {<br>

        transmogrify $buf;<br>

    }<br>

<br>

I&#39;ve got a bit of an obscure example that does this; take a look<br>

around line 547 of perl-chat:<br>

<br>

  <a href="http://foiani.com/perl/examples/perl-chat" target="_blank">http://foiani.com/perl/examples/perl-chat</a><br>

<br>

    # and anything we had left over from last time<br>

    my $buf = $read_buf{$sock};<br>

<br>

    my $n_bytes_read = sysread($sock, $buf, SOCK_IO_SIZE, length $buf);<br>

    if (not defined $n_bytes_read)<br>

    {<br>

      print STDERR &quot;sysread returned error: $!&quot;;<br>

      $socket_info{$sock}{state} = SS_CLOSE_IMMED;<br>

    }<br>

    elsif ($n_bytes_read == 0)<br>

    {<br>

      # if we didn&#39;t read any more bytes, then we probably want to remove it.<br>

      $socket_info{$sock}{state} = SS_CLOSE_PENDING;<br>

    }<br>

    else<br>

    {<br>

      # split the buffer on newlines...<br>

      my @lines = split /\n/, $buf, -1;<br>

      # ... saving any stragglers for the next time around<br>

      $buf = pop @lines;<br>

<br>

      # and then process each line.<br>

      foreach (@lines)<br>

      {<br>

        handle_line($sock, $_);<br>

        print STDERR &quot;r&quot; if $config{DEBUG};<br>

      }<br>

    }<br>

<br>

    # store the remnants for the next time around.<br>

    $read_buf{$sock} = $buf;<br>

<br>

I can expand on this technique if that [ancient!] code isn&#39;t<br>

particularly enlightening.  :)<br>

<div class="im"><br>

&gt; In any case, thanks for the attention,<br>

<br>

</div>Hope it helps.  :)<br>

<br>

Happy hacking,<br>

t.<br>

</blockquote></div><br><br clear="all"><br>-- <br>Realisant mon espoir, je me lance vers la gloire.<br>Christopher Hahn == <a href="mailto:xrz1138@gmail.com">xrz1138@gmail.com</a><br>