[San-Diego-pm] odd chars in file "Killing" my console
Christopher Hahn
xrz1138 at gmail.com
Thu Nov 11 01:33:36 PST 2010
Thank you both for taking the time!
A smaller test dump is divide-and-conquer...I should've thought of
this, but I was challenged with trying to slam out a script in a half-day.
I will ponder the techs suggested Tony.
Thank you again. A good word goes a long way.
Chris
On Thu, Nov 11, 2010 at 1:26 AM, Anthony Foiani <tkil at scrye.com> wrote:
>
> Christopher --
>
> Christopher Hahn <xrz1138 at gmail.com> writes:
> > I am trying to parse a huge (7 Gb) file that is line oriented but
> > has large sections that are any kind of binary character.
> > [...]
> > I am sure that there are odd chars in the file that are doing this....
>
> That's really unlikely. What makes you sure?
>
> You could try doing a simple replace of all non-printable chars with,
> say "!", and see if it still chokes.
>
> > I tried setting binmode on the input file handle, and just loading
> > the entire file into a buffer, just as a test, as we have enough
> > memory to do this.
>
> And for 7GB, you really want to use 'mmap', not "read the buffer in".
> There are Perl modules to do that, although I've never used them.
>
> Oooooh!
>
> "Note that PerlIO now defines a :mmap tag and presents mmap'd files
> as regular files, if that is your cup of joe."
> -- http://search.cpan.org/~toddr/Sys-Mmap-0.14/Mmap.pm<http://search.cpan.org/%7Etoddr/Sys-Mmap-0.14/Mmap.pm>
>
> > I watched using "top" and after the memory used climbed to a tad
> > more than the size of the file, the "Killed" message appeared and
> > the console closed itself.
>
> Just to check, did you turn off the relevant limits? "ulimit -a"
>
> What kind of system is this? If it is a non-ancient Linux system,
> there is the "Out Of Memory Killer" ("OOM Killer") to contend with.
> Looking through the logs should point fingers if that's the case.
>
> That's pretty much the only thing I can think about that would kill
> your console session, actually; normally perl itself would just give
> up and/or choke, and return control to your tty.
>
> > I have to stay at work until this is done, and so am just hoping the
> > someone if online and can give me the kick in the head that I need.
>
> You can also use "strace" on linux (and "truss" on solaris and bsd?);
> that might tell you which system call actually failed. You probably
> want to send output to a file, since having it sent to a terminal that
> is disconnected is not horribly useful.
>
> strace -o boom.txt -f -tt ./my-perl-script-here.plx ...
>
> Finally, another way to deal with arbitrarily large but still
> "line-oriented" strings is to read in chunks (say, 4096 bytes), look
> for any complete structures, and then keep the "tail" from the
> previous blocks, append the next 4096 bytes, etc:
>
> # what separates records?
> my $sep_re = qr/xyzzy/;
>
> # how much to read at a time?
> my $chunk_size = 1 << 16;
>
> # read in chunks at a time, split into records, continue.
> my $buf;
> while ( my $n_read = read IN, $buf, $chunk_size, length $buf )
> {
> # split into records, preserving any leading/trailing nulls
> my @recs = split $sep_re, $buf, -1;
>
> # store any trailer into the buffer for next read
> $buf = pop @recs;
>
> # now process the records that have been found
> foreach my $rec ( @recs )
> {
> transmogrify $rec;
> }
> }
>
> # handle end-of-file stragglers
> if ( length( $buf ) )
> {
> transmogrify $buf;
> }
>
> I've got a bit of an obscure example that does this; take a look
> around line 547 of perl-chat:
>
> http://foiani.com/perl/examples/perl-chat
>
> # and anything we had left over from last time
> my $buf = $read_buf{$sock};
>
> my $n_bytes_read = sysread($sock, $buf, SOCK_IO_SIZE, length $buf);
> if (not defined $n_bytes_read)
> {
> print STDERR "sysread returned error: $!";
> $socket_info{$sock}{state} = SS_CLOSE_IMMED;
> }
> elsif ($n_bytes_read == 0)
> {
> # if we didn't read any more bytes, then we probably want to remove
> it.
> $socket_info{$sock}{state} = SS_CLOSE_PENDING;
> }
> else
> {
> # split the buffer on newlines...
> my @lines = split /\n/, $buf, -1;
> # ... saving any stragglers for the next time around
> $buf = pop @lines;
>
> # and then process each line.
> foreach (@lines)
> {
> handle_line($sock, $_);
> print STDERR "r" if $config{DEBUG};
> }
> }
>
> # store the remnants for the next time around.
> $read_buf{$sock} = $buf;
>
> I can expand on this technique if that [ancient!] code isn't
> particularly enlightening. :)
>
> > In any case, thanks for the attention,
>
> Hope it helps. :)
>
> Happy hacking,
> t.
>
--
Realisant mon espoir, je me lance vers la gloire.
Christopher Hahn == xrz1138 at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/san-diego-pm/attachments/20101111/fd1faf5f/attachment-0001.html>
More information about the San-Diego-pm
mailing list