APM: html parsing / updating q.

Wed Nov 3 12:43:10 CST 2004

hi list,

I have some specific perlish problems I'd like advise on, but I'll 
outline what I'm trying to achieve first. This became a long email and 
you'll probably get lost as I recreate the vast confusion that surrounds 
this whole thing in my mind.. but hopefully there's enough here to 
solicit some useful pointers or techniques that might help?

I have a cgi that is configured as a directoryIndex and/or text/html 
handler (for apache 1.3) Its a CGI::Application, with the default 
run-mode being to simply wrap and/or transform the requested html file.
It does things like extrating blocks from the original html and 
inserting them into a template, adding in somewhat dynamic navigation 
and so on. I'm also working on an edit/update run mode which would a) 
draw out form elements to edit these content blocks, and b) write the 
changed content back into the original file. The same basic wrapper will 
be reused somehow when I come to things like a 404 handler, search 
results etc.

The original html is fairly simple, with just enough formatting and 
structure to make editing and updates easy to a novice using a wysiwyg 
editor. There's markers in there that my wrapper pairs up with 
placeholders in the (html::template) template to knit them together. 
Mostly I use id attributes, but I also support <!-- START: ContentBlock 
--> <!-- END: ContentBlock --> kind of constructs.

My first pass at this was using XHTML for my source document format, and 
  XSLT (and XML::LibXML / XML::LibXSLT). I hit a couple of snags:
1) Some namespace issue (i think) was preventing my doing the simplest 
transformations in XSLT. That and character set / encoding / entity 
issues, along with my inexperience with XSLT was frustrating. None of 
this was insurmountable, but the second snag was:
2) Guaranteeing a well formed XTML file, when the author was either 
inexperienced with HTML in general, or even given that experience (I'm 
planning on using the same system on my personal site) the reality is 
that mistakes happen. Any system which relies on well-formedness to 
render a page is fundamentally flawed here.

So, I turned to HTML::TreeBuilder and HTML::Element, which are more 
forgiving by design. I have something working (e.g. see at 
http://www.umlaufsculpture.org/ .. where the wrapper is really trivial 
and currently the whole thing could be probably be achieved by a couple 
of SSIs.)
Now I want to add editing functionality - and hit snags working with 
HTML::Element. It will let me insert new nodes, but even if the original 
markup was well formed, the new output always has optional end tags 
ommitted etc. I'm not seeing anything that addresses this behaviour in 
the perldocs. This is a problem - the quality of the output is important 
to me (I'm wanting to use this on a resume site where my ability to code 
valid clean html is one of the things I'm showcasing)

Furthermore, I'd like to implement a DOM-like interface and add get/set 
methods for elements. But I'm not sure how I can leverage 
HTML::TreeBuilder to build its tree with my own class of objects instead 
HTML::Elements. (simple OO perl question?)

I'm aware there's likely a lot of wheel reinventing here. My other 
requirement that limits my choices is that mod_perl is not available to 
me. This rules out HTML::Mason, and probably lots of other better ideas.

More thoughts .. If well-formedness is the only obstacle to using LibXML 
and friends, maybe I can somehow efficiently trap or head off any 
potential parse errors before it all blows up?
There's a tidylib (http://tidy.sf.net/).. but the perl XS interface is 
still in the works, so no help there. Does anyone know how libxml2's 
recover() method works?

Also, this is as always as much a learning exercise as anything else for 
me, so I'd like to work through some of these problems before switching 
horses entirely.

thanks for any and all thoughts,

Sam