<br>Thank you both for taking the time!<br><br>A smaller test dump is divide-and-conquer...I should've thought of<br>this, but I was challenged with trying to slam out a script in a half-day.<br><br>I will ponder the techs suggested Tony.<br>
<br>Thank you again. A good word goes a long way.<br><br>Chris<br><br><div class="gmail_quote">On Thu, Nov 11, 2010 at 1:26 AM, Anthony Foiani <span dir="ltr"><<a href="mailto:tkil@scrye.com">tkil@scrye.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><br>
Christopher --<br>
<div class="im"><br>
Christopher Hahn <<a href="mailto:xrz1138@gmail.com">xrz1138@gmail.com</a>> writes:<br>
> I am trying to parse a huge (7 Gb) file that is line oriented but<br>
> has large sections that are any kind of binary character.<br>
</div>> [...]<br>
<div class="im">> I am sure that there are odd chars in the file that are doing this....<br>
<br>
</div>That's really unlikely. What makes you sure?<br>
<br>
You could try doing a simple replace of all non-printable chars with,<br>
say "!", and see if it still chokes.<br>
<div class="im"><br>
> I tried setting binmode on the input file handle, and just loading<br>
> the entire file into a buffer, just as a test, as we have enough<br>
> memory to do this.<br>
<br>
</div>And for 7GB, you really want to use 'mmap', not "read the buffer in".<br>
There are Perl modules to do that, although I've never used them.<br>
<br>
Oooooh!<br>
<br>
"Note that PerlIO now defines a :mmap tag and presents mmap'd files<br>
as regular files, if that is your cup of joe."<br>
-- <a href="http://search.cpan.org/%7Etoddr/Sys-Mmap-0.14/Mmap.pm" target="_blank">http://search.cpan.org/~toddr/Sys-Mmap-0.14/Mmap.pm</a><br>
<div class="im"><br>
> I watched using "top" and after the memory used climbed to a tad<br>
> more than the size of the file, the "Killed" message appeared and<br>
> the console closed itself.<br>
<br>
</div>Just to check, did you turn off the relevant limits? "ulimit -a"<br>
<br>
What kind of system is this? If it is a non-ancient Linux system,<br>
there is the "Out Of Memory Killer" ("OOM Killer") to contend with.<br>
Looking through the logs should point fingers if that's the case.<br>
<br>
That's pretty much the only thing I can think about that would kill<br>
your console session, actually; normally perl itself would just give<br>
up and/or choke, and return control to your tty.<br>
<div class="im"><br>
> I have to stay at work until this is done, and so am just hoping the<br>
> someone if online and can give me the kick in the head that I need.<br>
<br>
</div>You can also use "strace" on linux (and "truss" on solaris and bsd?);<br>
that might tell you which system call actually failed. You probably<br>
want to send output to a file, since having it sent to a terminal that<br>
is disconnected is not horribly useful.<br>
<br>
strace -o boom.txt -f -tt ./my-perl-script-here.plx ...<br>
<br>
Finally, another way to deal with arbitrarily large but still<br>
"line-oriented" strings is to read in chunks (say, 4096 bytes), look<br>
for any complete structures, and then keep the "tail" from the<br>
previous blocks, append the next 4096 bytes, etc:<br>
<br>
# what separates records?<br>
my $sep_re = qr/xyzzy/;<br>
<br>
# how much to read at a time?<br>
my $chunk_size = 1 << 16;<br>
<br>
# read in chunks at a time, split into records, continue.<br>
my $buf;<br>
while ( my $n_read = read IN, $buf, $chunk_size, length $buf )<br>
{<br>
# split into records, preserving any leading/trailing nulls<br>
my @recs = split $sep_re, $buf, -1;<br>
<br>
# store any trailer into the buffer for next read<br>
$buf = pop @recs;<br>
<br>
# now process the records that have been found<br>
foreach my $rec ( @recs )<br>
{<br>
transmogrify $rec;<br>
}<br>
}<br>
<br>
# handle end-of-file stragglers<br>
if ( length( $buf ) )<br>
{<br>
transmogrify $buf;<br>
}<br>
<br>
I've got a bit of an obscure example that does this; take a look<br>
around line 547 of perl-chat:<br>
<br>
<a href="http://foiani.com/perl/examples/perl-chat" target="_blank">http://foiani.com/perl/examples/perl-chat</a><br>
<br>
# and anything we had left over from last time<br>
my $buf = $read_buf{$sock};<br>
<br>
my $n_bytes_read = sysread($sock, $buf, SOCK_IO_SIZE, length $buf);<br>
if (not defined $n_bytes_read)<br>
{<br>
print STDERR "sysread returned error: $!";<br>
$socket_info{$sock}{state} = SS_CLOSE_IMMED;<br>
}<br>
elsif ($n_bytes_read == 0)<br>
{<br>
# if we didn't read any more bytes, then we probably want to remove it.<br>
$socket_info{$sock}{state} = SS_CLOSE_PENDING;<br>
}<br>
else<br>
{<br>
# split the buffer on newlines...<br>
my @lines = split /\n/, $buf, -1;<br>
# ... saving any stragglers for the next time around<br>
$buf = pop @lines;<br>
<br>
# and then process each line.<br>
foreach (@lines)<br>
{<br>
handle_line($sock, $_);<br>
print STDERR "r" if $config{DEBUG};<br>
}<br>
}<br>
<br>
# store the remnants for the next time around.<br>
$read_buf{$sock} = $buf;<br>
<br>
I can expand on this technique if that [ancient!] code isn't<br>
particularly enlightening. :)<br>
<div class="im"><br>
> In any case, thanks for the attention,<br>
<br>
</div>Hope it helps. :)<br>
<br>
Happy hacking,<br>
t.<br>
</blockquote></div><br><br clear="all"><br>-- <br>Realisant mon espoir, je me lance vers la gloire.<br>Christopher Hahn == <a href="mailto:xrz1138@gmail.com">xrz1138@gmail.com</a><br>