[tpm] Solutions and kibitzers

Liam R E Quin liam at holoweb.net
Mon Nov 16 14:43:58 PST 2015


On Tue, 22 Oct 2013 12:45:20 -0400
arocker at Vex.Net wrote:

> It seemed to be a simple problem, parsing some sort of *ML stream, and
> wc's output on the script was  25  88 526. (6 of those 25 lines do the
> actual work.)
> 
> To my surprise, I've received all sorts of abuse for not using an XML
> parser module. (To which the poster may or may not have had easy access.)

If they had Perl they had an XML parser.

The problem with handling XML as text is that people often don't account for what seem like corner cases.

Some examples:

1. these are all the same in XML:
  <boy socks='black'></boy>
  <boy
      socks="black"
  />

  <boy  socks = "black"></boy
  >

  <boy socks="bl&#x61;ck"  />

  The following variant may or may not be the same, but is still legal:

  <boy socks='black'><!-- . . . . --></boy>

  Did you account for all of them?

2. text entities,
   <!DOCTYPE boy [
     <!ENTITY socks "black">
   ]>

   <boy socks="&socks;">

3. UTF-8 is common on Unix systems but other encodings are legal, and are signaled with
   an XML encoding declaration; did you handle them?

A five-minute hack that isn't for production is one thing; a program or production is another.

Many (not all) things you might use Perl for with XML are better done with XSLT and/or XQuery.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/


More information about the toronto-pm mailing list