[Jax.PM] ~9M lines of data
greg at turnstep.com
greg at turnstep.com
Tue Oct 15 10:19:14 CDT 2002
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
To quote J Proctor:
>> In fact I vote Greg to be the Jax.PM Leader - Yeah Greg!
> No way. You're not getting out of it *that* easy, Sneexie.
Not to mention the fact that (a) I have way too many projects already.
and (b) I'm not even in Jax anymore. :)
> Didn't realize index() was that much faster, though. Curious if you'd
> like to try a version each way and report back. Anchoring the regexes to
> the beginning of the line (i.e. /^CREATE TABLE/ should be fairly well
> optimized, and I *thought* (Greg, please correct me) that index() wasn't
> context-aware enough to say the goal is to match at the beginning of the
> line, if it doesn't, move on. So one regex per line versus (length $_ -
> length $target) flat comparisons doesn't seem like there'd be that much of
> an advantage.
Well, index() does not have any concept of "context-aware" - it is simply a
very quick byte-by-byte search, similar in spirit to the some of the C
"string" functions. (see man index). The problem comes in that it will scan
the entire line, looking for a match, while the regular expression can
be conveniently anchored. Using a regular expression has a small bit of
overhead however, and despite the fact that it may *almost* be as fast
as index() and friends, it never shall be. There are plenty of specific
cases in which it will be faster, of course, and 9 times out of 10, a
regex is your best bet. For those times where you are doing a lot of
work and speed is an issue, I maintain that index is better overall.
Having said that, I'm almost willing to recant my earlier code, and go
for the simple anchored regex if for no other reason than to make the
code a bit more readable. I've always had a soft spot for some of the
underused functions like index() in perl. Be glad I didn't find a way to
throw in vec() and prototype() :)
Aaron Johnson opines:
> Have you considered the inverse? Looking at what you don't want vs.
> what you want.
I like this idea the best. When faced with a large file like this,
I tend to crop things out one at a time until I know exactly what is
in the file, so no surprises crop up later. I think the little code
snippet Aaron posted is better than my solution.
Greg Sabino Mullane greg at turnstep.com
PGP Key: 0x14964AC8 200210151124
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)
Comment: http://www.turnstep.com/pgp.html
iD8DBQE9rDOVvJuQZxSWSsgRAu0+AKDfPVvWMHD4wbFmufeeaxbNjyMkSQCgrcXY
a0PLO4KOQuXhgjEvF+wnqWk=
=3l8D
-----END PGP SIGNATURE-----
More information about the Jacksonville-pm
mailing list