[pm-h] relocate metatags in hyperlatex-generated HTML

John Lightsey john at nixnuts.net
Fri Feb 9 16:41:36 PST 2007


On Fri, 2007-02-09 at 11:42 -0600, Russell L. Harris wrote:

...

> In the Llama book, in a footnote in chapter 9 ("Processing Text with
> Regular Expressions") is the following warning:  "...you can't
> correctly parse HTML with simple regular expressions.  If you need to
> work with HTML or a similar markup language, use a module that's made
> to handle the complexities."  What am I to make of this warning?

I don't have a copy of the book so I don't know the exact context, but I
assume he's referring to the fact that HTML is like Perl in the sense
you can write HTML that behaves identically in a variety of ways.

<b><a href="...">text</a></b>
<a href="..."><b>text</b></a>
<a name="mylink" href="..."><b>text</b></a>
<b><a hreflang="en" href="...">text</a></b>
<A CLASS="bold" HREF="...">text</A>
<A style="font-weight: bold" href="...">text</a>

All those do the same basic thing..  even

<a
href
=
...
>
<b>text</b>
</a
>

So using a regular expression to parse HTML is just brittle.  You can't
possibly account for all of the legal variations of HTML syntax.  The
more robust alternative is to use something like HTML::TokeParser or
HTML::TreeBuilder to do most of the work.

OTOH, if you're just grabbing two misplaced <meta name="whatever" ...>
tags from the body and inserting it before </head>, you can put the
entire document in one scalar and make the change with a one line
substitution.

$html =~ s/(</head>.*)(<meta\s[^>]+>)(.*)(<meta\s[^>]+>)/\2\4\1\3/is;

Using one of the more robust methods of processing HTML is likely
overkill.


John



More information about the Houston mailing list