[boulder.pm] problem with HTML::LinkExtor and <applet>
Walter Pienciak
walter at frii.com
Thu Sep 14 15:00:01 CDT 2000
On Thu, 14 Sep 2000, rise wrote:
> Having poked around a bit I still have no real idea how to fix it, but it
> looks to me like the failure mode is that an applet tag isn't a legal URI
> and thus HTML::Parser isn't parsing it as a link at all. If that's the
> case then 'fixing' HTML::LinkExtor would probably mean breaking
> HTML::Parser's standard's conformance. It's hackish, but have you looked
> at doing a two pass link extraction: one with LinkExtor and another
> looking for applet tags. Since they're tags (even if they turn out to not
> be URIs) you can probably pull them out with HTML::Parser itself and
> reformat them as a link.
Hi, (Jonathan|Jon|John|rise),
Ukkk. I *really* want to avoid writing an <applet> handler for HTML::Parser.
(In fact, I dislike writing handlers for any of the *::Parser stuff:
my mind seems to be organized differently than those things. ;^)
Anyway, the context of my comments below is that I'm talking myself into
a belief that HTML::Parser has a bug.
Here's my thinking:
I *do* believe that the applet tag's attributes comprise a legal URI.
The format is that of a relative URI.
The base URI is derivable via the normative algorithm described in RFC1808,
section 8 (and also in RFC2396, I think). Granted, the <applet> tag has
some unique characteristics, but 1808 states
It is beyond the scope of this document to specify how, for each
media type, the base URL can be embedded. It is assumed that user
agents manipulating such media types will be able to obtain the
appropriate syntax from that media type's specification.
and the HTML spec I checked (4.01, at http://www.w3.org/TR/html4/struct/objects.html#h-13.4) is very specific about the nature of the suspect attributes:
codebase
This attribute specifies the base URI for the applet. If this attribute
is not specified, then it defaults the same base URI as for the current
document. Values for this attribute may only refer to subdirectories of
the directory containing the current document.
code
This attribute specifies either the name of the class file that contains
the applet's compiled applet subclass or the path to get the class,
including the class file itself. It is interpreted with respect to the
applet's codebase. One of code or object must be present.
So all that spec-quoting means, to me, that there's no reason for HTML::Parser
to be mishandling <applet> URIs: attribute specs nail them down unambiguously.
I like your suggestion about doing two passes (the second pass would need to
identify the <applet> stuff, re-create the bogus URLs, delete them from the
data structures where they'd been added, and then generate and insert the
*correct* URLs. I may go that route. I'm just lazy, that's all . . .
I may wind up heaving this onto CLPM and seeing what happens. I can't believe
I'm the first person to be dealing with this.
Walter
More information about the Boulder-pm
mailing list