SPUG: not_quite_XML::Parser

Bill Campbell bill at celestial.com
Sat Feb 10 21:14:18 PST 2007


On Sat, Feb 10, 2007, Joshua ben Jore wrote:
>On 2/8/07, Michael R. Wolf <MichaelRWolf at att.net> wrote:
>> I've got some almost_XML code.  That is, it is not well-formed.  Almost well
>> formed, but "almost" is "not".  It appears to be line-oriented enough that a
>> simple-minded line processing could clean it up, but I don't want to rely on
>> simple-minded if there's a TagSoup::Parser that I could use to clean it up.
>> Suggestions?
>
>XML::LibXML has an "HTML" feature which lets it handle badly formed
>input. I've even used it to scrape web sites. Works neat.

I often cheat and pipe HTML through the ``tidy'' program before
parsing it.  I've done some scripts that de-Microsoft HTML to get
rid of metadata, font, and color stuff to produce clean HTML
(they're in python though not perl :-).

Bill
--
INTERNET:   bill at Celestial.COM  Bill Campbell; Celestial Software, LLC
URL: http://www.celestial.com/  PO Box 820; 6641 E. Mercer Way
FAX:            (206) 232-9186  Mercer Island, WA 98040-0820; (206) 236-1676

"If taxation without consent is robbery, the United States government
has never had, has not now, and is never likely to have, a single honest
dollar in its treasury." -- Lysander Spooner, Letter to Grover Cleveland 1886


More information about the spug-list mailing list