[San-Diego-pm] odd chars in file "Killing" my console

Anthony Foiani tkil at scrye.com
Thu Nov 11 01:26:04 PST 2010


Christopher --

Christopher Hahn <xrz1138 at gmail.com> writes:
> I am trying to parse a huge (7 Gb) file that is line oriented but
> has large sections that are any kind of binary character.
> [...]
> I am sure that there are odd chars in the file that are doing this....

That's really unlikely.  What makes you sure?

You could try doing a simple replace of all non-printable chars with,
say "!", and see if it still chokes.

> I tried setting binmode on the input file handle, and just loading
> the entire file into a buffer, just as a test, as we have enough
> memory to do this.

And for 7GB, you really want to use 'mmap', not "read the buffer in".
There are Perl modules to do that, although I've never used them.

Oooooh!

  "Note that PerlIO now defines a :mmap tag and presents mmap'd files
  as regular files, if that is your cup of joe."
    -- http://search.cpan.org/~toddr/Sys-Mmap-0.14/Mmap.pm

> I watched using "top" and after the memory used climbed to a tad
> more than the size of the file, the "Killed" message appeared and
> the console closed itself.

Just to check, did you turn off the relevant limits?  "ulimit -a"

What kind of system is this?  If it is a non-ancient Linux system,
there is the "Out Of Memory Killer" ("OOM Killer") to contend with.
Looking through the logs should point fingers if that's the case.

That's pretty much the only thing I can think about that would kill
your console session, actually; normally perl itself would just give
up and/or choke, and return control to your tty.

> I have to stay at work until this is done, and so am just hoping the
> someone if online and can give me the kick in the head that I need.

You can also use "strace" on linux (and "truss" on solaris and bsd?);
that might tell you which system call actually failed.  You probably
want to send output to a file, since having it sent to a terminal that
is disconnected is not horribly useful.

  strace -o boom.txt -f -tt ./my-perl-script-here.plx ...

Finally, another way to deal with arbitrarily large but still
"line-oriented" strings is to read in chunks (say, 4096 bytes), look
for any complete structures, and then keep the "tail" from the
previous blocks, append the next 4096 bytes, etc:

    # what separates records?
    my $sep_re = qr/xyzzy/;

    # how much to read at a time?
    my $chunk_size = 1 << 16;

    # read in chunks at a time, split into records, continue.
    my $buf;
    while ( my $n_read = read IN, $buf, $chunk_size, length $buf )
    {
        # split into records, preserving any leading/trailing nulls
        my @recs = split $sep_re, $buf, -1;

        # store any trailer into the buffer for next read
        $buf = pop @recs;

        # now process the records that have been found
        foreach my $rec ( @recs )
        {
            transmogrify $rec;
        }
    }

    # handle end-of-file stragglers
    if ( length( $buf ) )
    {
        transmogrify $buf;
    }

I've got a bit of an obscure example that does this; take a look
around line 547 of perl-chat:

  http://foiani.com/perl/examples/perl-chat

    # and anything we had left over from last time
    my $buf = $read_buf{$sock};

    my $n_bytes_read = sysread($sock, $buf, SOCK_IO_SIZE, length $buf);
    if (not defined $n_bytes_read)
    {
      print STDERR "sysread returned error: $!";
      $socket_info{$sock}{state} = SS_CLOSE_IMMED;
    }
    elsif ($n_bytes_read == 0)
    {
      # if we didn't read any more bytes, then we probably want to remove it.
      $socket_info{$sock}{state} = SS_CLOSE_PENDING;
    }
    else
    {
      # split the buffer on newlines...
      my @lines = split /\n/, $buf, -1;
      # ... saving any stragglers for the next time around
      $buf = pop @lines;

      # and then process each line.
      foreach (@lines)
      {
	handle_line($sock, $_);
	print STDERR "r" if $config{DEBUG};
      }
    }

    # store the remnants for the next time around.
    $read_buf{$sock} = $buf;

I can expand on this technique if that [ancient!] code isn't
particularly enlightening.  :)

> In any case, thanks for the attention,

Hope it helps.  :)

Happy hacking,
t.


More information about the San-Diego-pm mailing list