bill at celestial.com
Sat Feb 10 21:14:18 PST 2007
On Sat, Feb 10, 2007, Joshua ben Jore wrote:
>On 2/8/07, Michael R. Wolf <MichaelRWolf at att.net> wrote:
>> I've got some almost_XML code. That is, it is not well-formed. Almost well
>> formed, but "almost" is "not". It appears to be line-oriented enough that a
>> simple-minded line processing could clean it up, but I don't want to rely on
>> simple-minded if there's a TagSoup::Parser that I could use to clean it up.
>XML::LibXML has an "HTML" feature which lets it handle badly formed
>input. I've even used it to scrape web sites. Works neat.
I often cheat and pipe HTML through the ``tidy'' program before
parsing it. I've done some scripts that de-Microsoft HTML to get
rid of metadata, font, and color stuff to produce clean HTML
(they're in python though not perl :-).
INTERNET: bill at Celestial.COM Bill Campbell; Celestial Software, LLC
URL: http://www.celestial.com/ PO Box 820; 6641 E. Mercer Way
FAX: (206) 232-9186 Mercer Island, WA 98040-0820; (206) 236-1676
"If taxation without consent is robbery, the United States government
has never had, has not now, and is never likely to have, a single honest
dollar in its treasury." -- Lysander Spooner, Letter to Grover Cleveland 1886
More information about the spug-list