[Jax.PM] ~9M lines of data

greg at turnstep.com greg at turnstep.com
Tue Oct 15 10:19:14 CDT 2002


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


To quote J Proctor:

>> In fact I vote Greg to be the Jax.PM Leader - Yeah Greg!

> No way.  You're not getting out of it *that* easy, Sneexie.

Not to mention the fact that (a) I have way too many projects already. 
and (b) I'm not even in Jax anymore. :)

> Didn't realize index() was that much faster, though.  Curious if you'd
> like to try a version each way and report back.  Anchoring the regexes to
> the beginning of the line (i.e. /^CREATE TABLE/ should be fairly well
> optimized, and I *thought* (Greg, please correct me) that index() wasn't
> context-aware enough to say the goal is to match at the beginning of the
> line, if it doesn't, move on.  So one regex per line versus (length $_ -
> length $target) flat comparisons doesn't seem like there'd be that much of
> an advantage.

Well, index() does not have any concept of "context-aware" - it is simply a 
very quick byte-by-byte search, similar in spirit to the some of the C 
"string" functions. (see man index). The problem comes in that it will scan 
the entire line, looking for a match, while the regular expression can 
be conveniently anchored. Using a regular expression has a small bit of 
overhead however, and despite the fact that it may *almost* be as fast 
as index() and friends, it never shall be. There are plenty of specific 
cases in which it will be faster, of course, and 9 times out of 10, a 
regex is your best bet. For those times where you are doing a lot of 
work and speed is an issue, I maintain that index is better overall.

Having said that, I'm almost willing to recant my earlier code, and go 
for the simple anchored regex if for no other reason than to make the 
code a bit more readable. I've always had a soft spot for some of the 
underused functions like index() in perl. Be glad I didn't find a way to 
throw in vec() and prototype() :)

Aaron Johnson opines:

> Have you considered the inverse?  Looking at what you don't want vs.
> what you want.

I like this idea the best. When faced with a large file like this, 
I tend to crop things out one at a time until I know exactly what is 
in the file, so no surprises crop up later. I think the little code 
snippet Aaron posted is better than my solution.

Greg Sabino Mullane greg at turnstep.com
PGP Key: 0x14964AC8 200210151124

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
Comment: http://www.turnstep.com/pgp.html

iD8DBQE9rDOVvJuQZxSWSsgRAu0+AKDfPVvWMHD4wbFmufeeaxbNjyMkSQCgrcXY
a0PLO4KOQuXhgjEvF+wnqWk=
=3l8D
-----END PGP SIGNATURE-----






More information about the Jacksonville-pm mailing list