[San-Diego-pm] odd chars in file "Killing" my console

Thu Nov 11 01:33:36 PST 2010

Thank you both for taking the time!

A smaller test dump is divide-and-conquer...I should've thought of
this, but I was challenged with trying to slam out a script in a half-day.

I will ponder the techs suggested Tony.

Thank you again.  A good word goes a long way.

Chris

On Thu, Nov 11, 2010 at 1:26 AM, Anthony Foiani <tkil at scrye.com> wrote:

>
> Christopher --
>
> Christopher Hahn <xrz1138 at gmail.com> writes:
> > I am trying to parse a huge (7 Gb) file that is line oriented but
> > has large sections that are any kind of binary character.
> > [...]
> > I am sure that there are odd chars in the file that are doing this....
>
> That's really unlikely.  What makes you sure?
>
> You could try doing a simple replace of all non-printable chars with,
> say "!", and see if it still chokes.
>
> > I tried setting binmode on the input file handle, and just loading
> > the entire file into a buffer, just as a test, as we have enough
> > memory to do this.
>
> And for 7GB, you really want to use 'mmap', not "read the buffer in".
> There are Perl modules to do that, although I've never used them.
>
> Oooooh!
>
>  "Note that PerlIO now defines a :mmap tag and presents mmap'd files
>  as regular files, if that is your cup of joe."
>    -- http://search.cpan.org/~toddr/Sys-Mmap-0.14/Mmap.pm<http://search.cpan.org/%7Etoddr/Sys-Mmap-0.14/Mmap.pm>
>
> > I watched using "top" and after the memory used climbed to a tad
> > more than the size of the file, the "Killed" message appeared and
> > the console closed itself.
>
> Just to check, did you turn off the relevant limits?  "ulimit -a"
>
> What kind of system is this?  If it is a non-ancient Linux system,
> there is the "Out Of Memory Killer" ("OOM Killer") to contend with.
> Looking through the logs should point fingers if that's the case.
>
> That's pretty much the only thing I can think about that would kill
> your console session, actually; normally perl itself would just give
> up and/or choke, and return control to your tty.
>
> > I have to stay at work until this is done, and so am just hoping the
> > someone if online and can give me the kick in the head that I need.
>
> You can also use "strace" on linux (and "truss" on solaris and bsd?);
> that might tell you which system call actually failed.  You probably
> want to send output to a file, since having it sent to a terminal that
> is disconnected is not horribly useful.
>
>  strace -o boom.txt -f -tt ./my-perl-script-here.plx ...
>
> Finally, another way to deal with arbitrarily large but still
> "line-oriented" strings is to read in chunks (say, 4096 bytes), look
> for any complete structures, and then keep the "tail" from the
> previous blocks, append the next 4096 bytes, etc:
>
>    # what separates records?
>    my $sep_re = qr/xyzzy/;
>
>    # how much to read at a time?
>    my $chunk_size = 1 << 16;
>
>    # read in chunks at a time, split into records, continue.
>    my $buf;
>    while ( my $n_read = read IN, $buf, $chunk_size, length $buf )
>    {
>        # split into records, preserving any leading/trailing nulls
>        my @recs = split $sep_re, $buf, -1;
>
>        # store any trailer into the buffer for next read
>        $buf = pop @recs;
>
>        # now process the records that have been found
>        foreach my $rec ( @recs )
>        {
>            transmogrify $rec;
>        }
>    }
>
>    # handle end-of-file stragglers
>    if ( length( $buf ) )
>    {
>        transmogrify $buf;
>    }
>
> I've got a bit of an obscure example that does this; take a look
> around line 547 of perl-chat:
>
>  http://foiani.com/perl/examples/perl-chat
>
>    # and anything we had left over from last time
>    my $buf = $read_buf{$sock};
>
>    my $n_bytes_read = sysread($sock, $buf, SOCK_IO_SIZE, length $buf);
>    if (not defined $n_bytes_read)
>    {
>      print STDERR "sysread returned error: $!";
>      $socket_info{$sock}{state} = SS_CLOSE_IMMED;
>    }
>    elsif ($n_bytes_read == 0)
>    {
>      # if we didn't read any more bytes, then we probably want to remove
> it.
>      $socket_info{$sock}{state} = SS_CLOSE_PENDING;
>    }
>    else
>    {
>      # split the buffer on newlines...
>      my @lines = split /\n/, $buf, -1;
>      # ... saving any stragglers for the next time around
>      $buf = pop @lines;
>
>      # and then process each line.
>      foreach (@lines)
>      {
>        handle_line($sock, $_);
>        print STDERR "r" if $config{DEBUG};
>      }
>    }
>
>    # store the remnants for the next time around.
>    $read_buf{$sock} = $buf;
>
> I can expand on this technique if that [ancient!] code isn't
> particularly enlightening.  :)
>
> > In any case, thanks for the attention,
>
> Hope it helps.  :)
>
> Happy hacking,
> t.
>

-- 
Realisant mon espoir, je me lance vers la gloire.
Christopher Hahn == xrz1138 at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/san-diego-pm/attachments/20101111/fd1faf5f/attachment-0001.html>