[Jax.PM] ~9M lines of data
greg at turnstep.com
greg at turnstep.com
Mon Oct 14 12:44:20 CDT 2002
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
#!/usr/bin/perl -- -*-fundamental-*-
Quick notes:
This line:
> if (/^CREATE\sTABLE\s\(/) {
I do not think it does what you think it does!
Add in a regex for the table name itself:
if (/^CREATE\sTABLE[^(]+\(/) {
> $|++
This probably does not do what you want, since it only affects STDOUT.
See the rewrite below for the (ok, a) way to do this.
> $ctr++; # Count lines read..
Not needed, as perl keeps track itself with the handy $. variable.
Just grab it *before* you close the file handle.
> s/\n/ /; # newline to space...
Unnecessary, as <RFILE> is already splitting on newlines, unless you
had something else in mind?
> s/^\s+//; # compress leading whitespace...
Not necessary: just add that to the regex text. Useless overhead to make perl change
something and then discard it.
> s/\s+$//; # compress trailing whitespace...
> next unless length; # anything to process?
This might be good for "empty" lines, but probably better to just let the if statements
below discard them. Or, do something like this:
next unless /\S/;
The if (..) next; construct is perfectly valid (and I sometimes use it myself)
but the code may be more readable if you change those to if..elsif..else
Upon further thought, the "print NFILE" section is repeated code, so
we can compress it with some judicious if elsif mangling. If those extra
newlines before a "CREATE TABLE" statement are important, however, you can
fall back to the double print method.
> if (/^\#\sDumpings\datas\for\stable\s\'/) {
Seems like overkill, since we know exactly what the line will be.
This also applies to the CREATE TABLE line above as well. No sense in
stressing the regex engine. Matter of fact, we can probably use the
far more efficient index() since we know the exact placement. If we can
avoid calling the regex engine at all, even better.
> $flag++;
Don't increment and decrement for a boolean counter, but specifically set it
to "1" and "0" - this will save you a lot of pain someday when you accidentally
increment it twice, and the decremented value is still true! :)
$flag=1; $flag=0;
## My rewrite. It may not do what you want, as I am not exactly sure of the
## program requirements (e.g. newlines)
#!/usr/bin/perl
## Removing strict wins a small speed increase at startup, but not
## noticeable for a 9 million line file, so leave it in
use strict;
## Best to name the files for easy changes later.
## May even want to allow them as arguments to the program.
my $infile = "Schema.Raw";
my $outfile = "Schema.Cooked";
open (RAWFILE, $infile) or die "\nCould not open $infile: $!\n";
open (COOKEDFILE, ">$outfile") or die "\nCould not open $outfile: $!\n";
## This is probably not needed, but here is how it is done:
select((select(COOKEDFILE),$|=1)[0]);
## Matter of fact, it may be faster without it, since the more often
## we dump to the new file, the slower it will be (in theory)
my $flag=0;
while (<RAWFILE>) {
## A "skip empty lines" test is tempting here, but then we are using a regex
## Strings to start and stop the copying
if (index($_,'CREATE TABLE') ==0) { $flag=1; }
elsif (index($_,'# Dumping data') ==0) { $flag=0; }
## I really like the "do if $var" syntax, but this could also be a normal if
print COOKEDFILE $_ if $flag;
} # end while loop
print "\nProcessed $. lines...\n\n";
close (COOKEDFILE) or die "\nCould not close $outfile: $!\n";
close (RAWFILE) or die "\nCould not close $infile: $!\n";
Greg Sabino Mullane greg at turnstep.com
PGP Key: 0x14964AC8 200210141335
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
Comment: http://www.turnstep.com/pgp.html
iD8DBQE9qwEUvJuQZxSWSsgRAuDZAKC7Fx1ev94qkEneuLRNvj9Rs1AirACfeKef
+ltAcY0HciqzNzM1D7BuPxE=
=T6uR
-----END PGP SIGNATURE-----
More information about the Jacksonville-pm
mailing list