[pm-h] relocate metatags in hyperlatex-generated HTML

Russell L. Harris rlharris at oplink.net
Fri Feb 9 09:42:45 PST 2007


This message begins a new thread titled 

    "relocate metatags in hyperlatex-generated HTML"

which continues the thread 

    "perl application: I'm in over my head".

The problem is how to relocate metatags in hyperlatex-generated HTML,
moving the tags from the body of the file to the head of the file.

I've just finished reading the 4th edition of the O'Reilly Llama book,
"Learning Perl"; and I'm still "in over my head".



> On Wed, 7 Feb 2007 05:09:08 -0600
> "Russell L. Harris" <rlharris at oplink.net> wrote:
> 
>> (1) Search for the tag: 
>> 
>>     <meta name="description" ... >
>> 
>> (2) If the tag is found, move the tag from the body of the HTML file
>> to the head of the HTML file, inserting it immediately following the
>> line of the title tag: 
>> 
>>     <title> ... </title>
>> 
>> (3) Search for the tag: 
>> 
>>     <meta name="keywords" ... >
>> 
>> (4) If the tag is found, move the tag from the body of the HTML file
>> to the head of the HTML file, inserting it immediately following the
>> line of the title tag:
>> 
>>     <title> ... </title>



* G. Wade Johnson <gwadej at anomaly.org> [070207 09:43]:
>
> I would suggest reading the file into an array of lines. You can use
> regular expressions on each line to find the lines of interest.
> 
> Use the splice operator to remove the meta lines from the array.
>
> Use the splice operator to insert the meta lines after the title line.
> 
> Write out all lines to replace the old file.
> 
> By using this approach and the -i option, you can process all of the
> files in a directory with:
> 
>   perl -i.bak script.pl *.html



* G. Wade Johnson <gwadej at anomaly.org> [070207 21:56]:
> 
> ----------------------------------------------------
> #!/usr/bin/perl -i.bak
> 
> use strict;
> use warnings;
> 
> # slurp the whole file as a single string.
> undef $/;
> 
> while(<>)
> {
>     # split the file into a list of lines, losing the newline
>     # in the process
>     my @lines = split( /\r?\n/, $_ );
> 
>     # process the lines here
> 
>     # If you don't print these out, the new file will be empty.
>     print join( "\n", @lines );
> }
> -----------------------------------------------------
 
OK; I have this framework running.  A new file is being generated and
the old file is saved with the ".bak" extension.  

But I haven't figured out how to use regular expression matching to
obtain the offsets and lengths needed by "splice".

For the "processing", it appears to me that I must:

    (1) find within the array @lines the offset of the line following
    the <\title> tag; this is the insertion point

    (2) find within the array @lines the offset and the length of the
    "keywords" metatag, which is the second of the two tags

    (3) call splice to remove the "keywords" metatag from the array
    @lines

    (4) insert the "keywords" metatag at the insertion point in the
    array @lines

    (5) find within the array @lines the offset and the length length
    of the "description" metatag, which is the first of the two tags

    (6) call splice to remove the "description" metatag from the array
    @lines

    (7) insert the "description" metatag at the insertion point in the
    array @lines

By moving the second tag before moving the first, the offset of the
insertion point does not change.  

In the Llama book, in a footnote in chapter 9 ("Processing Text with
Regular Expressions") is the following warning:  "...you can't
correctly parse HTML with simple regular expressions.  If you need to
work with HTML or a similar markup language, use a module that's made
to handle the complexities."  What am I to make of this warning?

RLH


More information about the Houston mailing list