[pm-h] relocate metatags in hyperlatex-generated HTML

G. Wade Johnson gwadej at anomaly.org
Fri Feb 9 21:21:43 PST 2007


On Fri, 9 Feb 2007 11:42:45 -0600
"Russell L. Harris" <rlharris at oplink.net> wrote:

> This message begins a new thread titled 
> 
>     "relocate metatags in hyperlatex-generated HTML"
> 
> which continues the thread 
> 
>     "perl application: I'm in over my head".
> 
> The problem is how to relocate metatags in hyperlatex-generated HTML,
> moving the tags from the body of the file to the head of the file.
> 
> I've just finished reading the 4th edition of the O'Reilly Llama book,
> "Learning Perl"; and I'm still "in over my head".

Sounds like a good start.

[snip]


> OK; I have this framework running.  A new file is being generated and
> the old file is saved with the ".bak" extension.  
> 
> But I haven't figured out how to use regular expression matching to
> obtain the offsets and lengths needed by "splice".
> 
> For the "processing", it appears to me that I must:
> 
>     (1) find within the array @lines the offset of the line following
>     the <\title> tag; this is the insertion point
> 
>     (2) find within the array @lines the offset and the length of the
>     "keywords" metatag, which is the second of the two tags
> 
>     (3) call splice to remove the "keywords" metatag from the array
>     @lines
> 
>     (4) insert the "keywords" metatag at the insertion point in the
>     array @lines
> 
>     (5) find within the array @lines the offset and the length length
>     of the "description" metatag, which is the first of the two tags
> 
>     (6) call splice to remove the "description" metatag from the array
>     @lines
> 
>     (7) insert the "description" metatag at the insertion point in the
>     array @lines

This is a good description of what you need to do.

> By moving the second tag before moving the first, the offset of the
> insertion point does not change.  

Not a bad approach, although there's a slightly easier way.

> In the Llama book, in a footnote in chapter 9 ("Processing Text with
> Regular Expressions") is the following warning:  "...you can't
> correctly parse HTML with simple regular expressions.  If you need to
> work with HTML or a similar markup language, use a module that's made
> to handle the complexities."  What am I to make of this warning?

The "right" way to solve the problem would be to use an HTML parser.
But, in this case it would be overkill and would require you to learn a
lot more before you could get started (as John has already pointed out).

In this particular case, you have the code that is generating the meta
links, you are not working with HTML from the wild. If you were looking
at the general case of extracting something from HTML, you would need
the big guns. You may still want to go that route later when you are
more comfortable with the language an the problem.

Now, let's get back to the quick and dirty solution.

Let's start with some assumptions, let me know if I fail one of them.

1. The </title> end tag is not broken across a line boundary (pretty
safe).

2. The 'keywords' metatag is all on one line.

3. There is nothing else on the line with the 'keywords' metatag.

4. The 'description' metatag is all on one line.

5. There is nothing else on the line with the 'description' metatag.

If any of the above is not true, we will need to get a little more
complicated.

So let's work out the simple case.

The core of the processing is a loop over the lines by index. Most of
the time in Perl it is better to loop over an array with a foreach loop
one the elements themselves, but in this case we want to save off the
indexes of the elements we find interesting.

If you have a C, C++, Java, or JavaScript background, you should
recognize the loop

my $lines_count = @lines;
for(my $index = 0;$index < $lines_count;++$index)
{
    #Test lines here
}

If the assumptions given above hold, the tests are relatively
straight-forward. They would look something like:

  if($lines[$index] =~ m{</title>})
  {
      # save $index somewhere for later.
  }

The regex match can use delimiters other than / if you supply the
optional 'm'. This helps especially when parsing HTML/XML looking text,
because otherwise you have to escape every '/' that's in the real
expression.

Repeat as needed for the other pieces.

Then use splice to remove the ones you want to move and insert after
the title index.

G. Wade
-- 
One OS to rule them all, One OS to find them,
One OS to bring them all and in the darkness bind them,
In the land of Redmond, where the Windows lie.


More information about the Houston mailing list