[Jax.PM] ~9M lines of data

greg at turnstep.com greg at turnstep.com
Mon Oct 14 12:44:20 CDT 2002


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

#!/usr/bin/perl -- -*-fundamental-*-


Quick notes:


This line:

> if (/^CREATE\sTABLE\s\(/) {

I do not think it does what you think it does!

Add in a regex for the table name itself:
if (/^CREATE\sTABLE[^(]+\(/) {


> $|++

This probably does not do what you want, since it only affects STDOUT.
See the rewrite below for the (ok, a) way to do this.


> $ctr++;    # Count lines read..

Not needed, as perl keeps track itself with the handy $. variable.
Just grab it *before* you close the file handle.


> s/\n/ /;   # newline to space...

Unnecessary, as <RFILE> is already splitting on newlines, unless you 
had something else in mind?


> s/^\s+//;  # compress leading whitespace...

Not necessary: just add that to the regex text. Useless overhead to make perl change 
something and then discard it.


> s/\s+$//;  # compress trailing whitespace...
> next unless length; # anything to process?

This might be good for "empty" lines, but probably better to just let the if statements 
below discard them. Or, do something like this:
next unless /\S/;


The if (..) next; construct is perfectly valid (and I sometimes use it myself)
but the code may be more readable if you change those to if..elsif..else
Upon further thought, the "print NFILE" section is repeated code, so 
we can compress it with some judicious if elsif mangling. If those extra 
newlines before a "CREATE TABLE" statement are important, however, you can 
fall back to the double print method.


> if (/^\#\sDumpings\datas\for\stable\s\'/) {

Seems like overkill, since we know exactly what the line will be.
This also applies to the CREATE TABLE line above as well. No sense in 
stressing the regex engine. Matter of fact, we can probably use the 
far more efficient index() since we know the exact placement. If we can 
avoid calling the regex engine at all, even better.


> $flag++;

Don't increment and decrement for a boolean counter, but specifically set it 
to "1" and "0" - this will save you a lot of pain someday when you accidentally 
increment it twice, and the decremented value is still true! :)
$flag=1; $flag=0;



## My rewrite. It may not do what you want, as I am not exactly sure of the 
## program requirements (e.g. newlines)




#!/usr/bin/perl

## Removing strict wins a small speed increase at startup, but not 
## noticeable for a 9 million line file, so leave it in

use strict; 


## Best to name the files for easy changes later.
## May even want to allow them as arguments to the program.

my $infile  = "Schema.Raw";
my $outfile = "Schema.Cooked";

open (RAWFILE,    $infile)     or die "\nCould not open $infile: $!\n";
open (COOKEDFILE, ">$outfile") or die "\nCould not open $outfile: $!\n";

## This is probably not needed, but here is how it is done:

select((select(COOKEDFILE),$|=1)[0]);

## Matter of fact, it may be faster without it, since the more often 
## we dump to the new file, the slower it will be (in theory)


my $flag=0;

while (<RAWFILE>) { 

  ## A "skip empty lines" test is tempting here, but then we are using a regex

  ## Strings to start and stop the copying
  if    (index($_,'CREATE TABLE')   ==0) { $flag=1; }
  elsif (index($_,'# Dumping data') ==0) { $flag=0; }

  ## I really like the "do if $var" syntax, but this could also be a normal if

  print COOKEDFILE $_ if $flag;

} # end while loop 

print "\nProcessed $. lines...\n\n";

close (COOKEDFILE) or die "\nCould not close $outfile: $!\n";
close (RAWFILE)    or die "\nCould not close $infile: $!\n";


Greg Sabino Mullane greg at turnstep.com
PGP Key: 0x14964AC8 200210141335

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
Comment: http://www.turnstep.com/pgp.html

iD8DBQE9qwEUvJuQZxSWSsgRAuDZAKC7Fx1ev94qkEneuLRNvj9Rs1AirACfeKef
+ltAcY0HciqzNzM1D7BuPxE=
=T6uR
-----END PGP SIGNATURE-----





More information about the Jacksonville-pm mailing list