[Phoenix-pm] Meeting yesterday (June 30) wrapup

Fri Jul 1 12:55:33 PDT 2005

Hi everyone,

Thanks for coming, and thanks for listening to me talk. As usual, I talked
for longer than I meant to. Whoops. Nello's was unusually busy. We'll
probably do Nello's again, but not for a while, and not for code 
presentations... regardless, I think having food at the meetings made
things a lot easier (we're all busy people) so I think I'll see about
ordering out for pizza for our regular meetings, whereever they wind up. 

Good to meet all of the new people. I'm sorry I didn't a chance to chat with 
you guys more and I hope you'll be back. It's hard to get to know people
in two hours with so much chaos. 

I completely forgot to give the door prize, CGI Programming with Perl, to
Brock to give away. D'oh! Next meeting, we'll just have to have two door prizes. 
Sorry to everyone who only came because of the door prize. Next time, remind 
me, or Brock, or someone. 

Er, ehm, without further ado, here's yafro.pl.

Again, you shouldn't use http.pm or TransientBaby -- they're for educational
purposes only. If you actually do any Web scraping, use HTML::Parser,
HTML::TableExtractor, or something sane. Which means you'll have
to modify this to use another HTML parser. That shouldn't be hard to
do if you use an event based one. 

I was expecting people to chime in and comment on how *they* scraped 
Web content but instead Michael just gave a lot of examples of how he 
*blocks* robots. Heh, heh, heh. 

For the benefit of people not at the meeting, here are a few comments on
the code: would have been easier to just extract all images with URLs
matching a certain pattern, and the *get_page = http::generate_get_page;
this is odd and would have been better done with the Exporter (module).

Okay. Talk to ya'll later.
-scott

On  0, Brock <awwaiid at thelackthereof.org> wrote:
> 
> We had a lovely meeting yesterday, with 10 people enjoying dinner and
> perl-talk at Nello's Pizza. In addition to random perl-related
> conversation, Mike Friedman spoke of HighWire Press [1] and their use of
> Perl, and Scott Walters told us about how he does web-scraping (and he
> will post code soon we hope :) ).
> 
> We'll have the website up soon enough. In the meantime keep an eye out
> here for the next meeting topic/time/location, which I will have picked
> out within the next two weeks. Please send topic requests and
> volunteership here to the list. I like the idea of doing two talks like
> we did this time so that we can cover a potentially wider range of
> interest and experience.
> 
> Have a good (long) weekend!
> --Brock
> 
>   [1] http://highwire.stanford.edu/
> 
> _______________________________________________
> Phoenix-pm mailing list
> Phoenix-pm at pm.org
> http://mail.pm.org/mailman/listinfo/phoenix-pm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: yafro.pl
Type: application/x-perl
Size: 10120 bytes
Desc: not available
Url : http://mail.pm.org/pipermail/phoenix-pm/attachments/20050701/933934af/yafro-0001.bin
-------------- next part --------------

0. About

  a. This is how Scott scrapes Web content
  b. This is not necessarily how you should scrape Web content
  c. Minimal examples are for illustration of the grammars and protocols 

1. Fetching

  a. LWP
  b. LWP::Simple
  c. http.pm 
     **** 1st example ****
     I.   Handles POST + GET in IIS 
     II.  Note the socket read timeouts as an alternative to alarm
     III. Browser sessions are closures with cookies and the referer bound in
     IV.  Referer tag is automagically correct -- great for stubborn sites
     V.   Knows how to deal with ASP/IIS applications that expect GET and POST
          data to be separate
     VI.  HTTP consists of a connection, sending the handshake, sending
          headers, waiting for a reply, then closing the connection

2. Parsing

  a. Every novice wants to use regex -- regexen are:
     I.   Fragile
     II.  Non-reenterant, therefore non-recursive
     III. Not stateful, therefore not a grammar -- HTML requires a grammar
     IV.  Doomed to failure
  b. HTML::LinkExtor, HTML::LinkExtactor for easy spiders -- don't forget 
     robots.txt
  c. HTML::TableExtractor, HTML::TableContentParser
  d. HTML::TreeBuilder
     I.   Insert nodes

simple recurse through tree

     II.  Remove nodes
     III. Turn everything under a node back into HTML
     IV.  Easy ton navigate the structure of the document
     V.   Easy to modify document
     VI.  Hard to extract repeated structures in the document
  e. HTML::Parser
     I.   Easy to extract repeated non-table structures
     II.  Event driven -- HTML tags generate callback events
  f. my minimal HTML parser -- like HTML::Parser -- 
     ***** 3rd example ******
     I.   This simple, minimal parser is incomplete
     II.  The various states illustrate statefulness of the HTML grammar
          . Between tags
          . Inside of tag
          . Inside of quoted values inside of tags
  g. yafro.pl -- an exampling of using an event driven HTML parser
     I.   Extracts data in a hidden field from the login form
     II.  Logs in, keeping cookies, returning hidden field data 
     III. parse_index() parses photo index pages
          . performs a callback operation for each image link found
          . identifies link to previous month's index
          . identifies link to next "with-in month" index page
          . returns link to next "with-in month" page if it exists, otherwise 
            falls back on previous month's index page
     IV.  parse_index() uses coroutines to repeatedly fall back on the HTML 
          parser without having to return -- coroutines do for 'return' what
          funcation calls did for 'goto' -- this lets us keep state implicitly
     V.   parse_index()'s caller starts off with a known URL for the latest 
          index page and loops as long parse_index() returns it another index 
          page
     VI.   Each iteration, all discovered images are downloaded and the list 
          of queued images is zeroed
     VII. Cheats and computes the full image name/location from the thumbnail 
          name.
     IIX. yafro wasn't a hard site to crack, but a lot of this
          infrastructure was created for sites that are
  h. table tweakers... no, not today

3. Normalizing

  a. Most dynamic data comes from a database
  b. A lot of dynamic data comes from a relational, structured database
  c. Why not reconstruct the original database?
     I.   Web output is the result of a join'ing several tables together
     II.  To normalize it again, identify the keys, supporting values,
          and relations
     III. By virtue of being on the same row of output, values relate
          to each other

4. Other scrapers
   a. budweiser.com
   b. Thomas Register (geekpac)
   c. Google (geekdate)
   d. I forget...

   1 ----- Fetching -------
   2 
   3 1.a. LWP
   4 
   5          # Create a user agent object
   6          use LWP::UserAgent;
   7          $ua = LWP::UserAgent->new;
   8          $ua->agent("MyApp/0.1 ");
   9 
  10          # Create a request
  11          my $req = HTTP::Request->new(POST =>
     'http://search.cpan.org/search');
  12          $req->content_type('application/x-www-form-urlencoded');
  13          $req->content('query=libwww-perl&mode=dist');
  14 
  15          # Pass request to the user agent and get a response back
  16          my $res = $ua->request($req);
  17 
  18          # Check the outcome of the response
  19          if ($res->is_success) {
  20              print $res->content;
  21          }
  22 
  23 1.b. LWP::Simple
  24 
  25         use LWP::Simple;
  26         $content = get("http://www.sn.no/");
  27         die "Couldn't get it!" unless defined $content;
  28 
  29 1.c. http.pm
  30 
  31   use http;
  32   *get_page = http::generate_get_page;
  33   ($status, $headers, $html) = get_page(
  34       $next_page,
  35       state    => $hidden_state_value,
  36       login    => 'Login',
  37       %login_info,
  38   );
  39 
  40 
  41 ----- yafro.pl ------
  42 
  43 (Excerpts)
  44 
  45 #!/usr/bin/perl
  46 
  47 #
  48 # Downloads all pictures for a given user from yafro.com. 
  49 #
  50 # Usage: perl yafro.pl --user <username>
  51 #
  52 
  53 use strict;
  54 use warnings;
  55 
  56 use Socket;
  57 
  58 use IO::Handle;
  59 use IO::Socket::INET;
  60 
  61 use Coro;
  62 use Coro::Cont;
  63 
  64 use TransientBaby::Forms;
  65 use TransientBaby;
  66 
  67 use http;
  68 
  69 my $debug = 1;
  70 
  71 my %login_info = (
  72     email    => 'scott at slowass.net',
  73     passwd   => 'XX',
  74 );
  75 
  76 # ... (stuff here)
  77 
  78 $http::ua = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.8)
     Gecko/20050511 Firefox';
  79 
  80 *get_page = http::generate_get_page;
  81 
  82 sub parse_index {
  83 
  84     my $page = shift;
  85     my $imagehit = shift; ref $imagehit eq 'CODE' or die;
  86 
  87     my $prev_month;
  88     my $next_month;
  89     my $next_within_month;
  90 
  91     TransientBaby::Forms::parse_html($page, csub {
  92 
  93         # This crawls index pages, picks out photos, and calls-back links
     to the 
  94         # photos/photo pages (haven't decided) back. It yields links to
     the next
  95         # index page. This is the next page within the month or else the
     previous
  96         # month, when those are exhausted.
  97 
  98         no warnings 'uninitialized';
  99 
 100         my $accessor; 
 101         my %ent;
 102 
 103         # Previous month (one of):
 104 
 105         # <tr class="row0"><td class="col0"><table style="border: solid
     0px black" width=100% cellspacing=0 cellpadding=0><tr
     class="headerRow"><td class="col0" align=left><font size=2>Show x's
     Photos in:</font> &nbsp <font size=2><b><nobr><a
     href='http://x.yafro.com/photos/2005/5/0'><<</a> June 2005 <font
     size=2 color=#999999>>></font></nobr></b></font></td></tr>
 106 
 107         # <tr class="row0"><td class="col0"><table style="border: solid
     0px black" width=100% cellspacing=0 cellpadding=0><tr
     class="headerRow"><td class="col0" align=left><font size=2>Show x's
     Photos in:</font> &nbsp <font size=2><b><nobr><a
     href='http://x.yafro.com/photos/2005/5/0'><<</a> June 2005 <font
     size=2 color=#999999>>></font></nobr></b></font></td></tr>
 108 
 109         do {
 110             yield; ($accessor, %ent) = @_;
 111         } until $ent{tag} eq 'lit' and $ent{text} =~ m/Photos in/;
 112 
 113         $debug and print "1. 'Photos in'\n";
 114 
 115         for(;;) {
 116             yield; ($accessor, %ent) = @_;
 117             if($ent{tag} eq 'a') {
 118                 $prev_month = $ent{href};
 119                 last;
 120             } elsif($ent{tag} eq 'font' and $ent{color} eq '#999999') {
 121                 # <font size=2 color=#999999> separates prev and next
     links as far as we're concerned
 122                 $prev_month = undef;
 123                 last;
 124             }
 125         }
 126 
 127         $debug and print "2. \$prev_month: $prev_month\n";
 128 
 129         for(;;) {
 130             yield; ($accessor, %ent) = @_;
 131             if($ent{tag} eq 'a') {
 132                 $next_month = $ent{href};
 133                 last;
 134             } elsif($ent{tag} eq '/nobr') {
 135                 # the </nobr> means this section is all over
 136                 $next_month = undef;
 137                 last;
 138             }
 139         }
 140         
 141         $debug and print "3. \$next_month: $next_month\n";
 142 
 143         # Image:
 144 
 145         # <a href="http://x.yafro.com/photo/897523"
     onmouseout="hideddrivetip()" onmouseover="ddrivetip('........')">
 146         # <img src='http://photos.yafro.com/pics3/i/20050627/07/3/4
     a/34ab316226ff678b4a49a26bd50_thumb.jpg' border=0>
 147 
 148         # or:
 149 
 150         # <a href="http://x.yafro.com/photo/896918"><img
     src='http://photos.yafro.com/pics3/i/20050628/22/f/b/0/fb04c3dab88ed
     dff08821f95b9d400_thumb.jpg' border=0></a><br><nobr><font
     size=1>June 28</font><font size=1>, </font><font size=1>10:09
     PM</font></nobr><br><center><a href='http://x.yafro.com/photo/896918
     >2 comments</a>
 151 
 152         # "Photos in" seems to appear only before the prev/next month
     selected... not sure in which case this worked
 153         # do {
 154         #     yield; ($accessor, %ent) = @_;
 155         # } until $ent{tag} eq 'lit' and $ent{text} =~ m/Photos posted
     in/;
 156 
 157         $debug and print "4. 'Photos posted in'\n";
 158 
 159         for(;;) {
 160             yield; ($accessor, %ent) = @_;
 161             last if $ent{tag} eq 'lit' and $ent{text} =~ m/Pages for/;
 162             # $debug and print "debug: looking for image hits: tag is
     $ent{tag}\n";
 163             # if($ent{tag} eq 'a' and $ent{onmouseout} and
     $ent{onmouseover}) # images without labels don't have these!
 164             if($ent{tag} eq 'a') {
 165                 yield; ($accessor, %ent) = @_;
 166                 last if $ent{tag} eq 'lit' and $ent{text} =~ m/Pages
     for/;
 167                 if($ent{tag} eq 'img') {
 168                     $debug and print "5. Image hit: $ent{src}\n";
 169                     $imagehit->($ent{src});
 170                 } elsif($ent{tag} eq 'lit') {
 171                     next; # that's okay -- a href surrounds the image and
     an identical one surrounds any text after 
 172                           # the image.
 173                 } else {
 174                     print "debug: in middle of thumbnails, just after a
     href, expected img src, instead got $ent{tag}!\n";
 175                 }
 176             }
 177         }
 178 
 179         $debug and print "6. 'Pages for'\n";
 180 
 181         # Next index page within current month:
 182 
 183         # <tr class="row0"><td class="col0">Pages for June: <font size=1
     color=#999999><b><</b></font> <font size=3
     color=#990033><b>1</b></font> <a href='http://x.yafro.com/photos/200
     /6/1'><font size=2><b>2</b></font></a> <a
     href='http://x.yafro.com/photos/2005/6/2'><font
     size=2><b>3</b></font></a> <a href='http://x.yafro.com/photos/2005/6
     3'><font size=2><b>4</b></font></a>
 184 
 185         for(;;) {
 186             yield; ($accessor, %ent) = @_;
 187             last if $ent{tag} eq '/tr';
 188             # this sequence marks the current page: <font size=3
     color=#990033> <b>1</b></font>
 189             if($ent{tag} eq 'font' and $ent{color} eq '#990033') {
 190                 do {
 191                     yield; ($accessor, %ent) = @_;
 192                     last if $ent{tag} eq '/tr';
 193                     if($ent{tag} eq 'a') {
 194                         $debug and print "7. Next within-month page:
     $ent{href}\n";
 195                         $next_within_month = $ent{href};
 196                         last;
 197                     }
 198                 } until 0;
 199             }
 200         }
 201 
 202     });
 203 
 204     return $next_within_month if $next_within_month;
 205     return $prev_month if $prev_month;
 206     return undef; # no more indices to crawl
 207 
 208 }
 209 
 210 # ...
 211 
 212 my $status;
 213 my $headers; 
 214 my $html;
 215 
 216 my $next_page;
 217 
 218 # ...
 219 
 220 # log-in as self so I can see the pr0n
 221 
 222 ($status, $headers, $html) = get_page(
 223     'http://www.yafro.com/',
 224 );
 225 
 226 my $hidden_state_value;
 227 
 228 TransientBaby::Forms::parse_html($html, sub {
 229     my $accessor = shift;
 230     my %ent = @_;
 231     $next_page = $ent{action} if $ent{tag} eq 'form' and exists
     $ent{name} and $ent{name} eq 'loginbox';
 232     $hidden_state_value = $ent{value} if $ent{tag} eq 'input' and exists
     $ent{name} and $ent{name} eq 'state';
 233 });
 234 
 235 # ...
 236 
 237 ($status, $headers, $html) = get_page(
 238     $next_page,
 239     state    => $hidden_state_value,
 240     login    => 'Login',
 241     %login_info,
 242 );
 243 
 244 $html =~ m/My Photos posted in/ or die "Login apparently failed - string
     'My Photos posted in' not found in results";
 245 
 246 # ...
 247 
 248 $next_page = "http://$user.yafro.com";
 249 
 250 do {
 251     ($status, $headers, $html) = get_page($next_page);
 252     die unless $status == 200;
 253     my @imagehits;
 254     $next_page = parse_index(
 255         $html, sub { push @imagehits, shift(); },
 256     );
 257     printf "\n\nimage hits: %d\nnext page: %s\n\n", scalar @imagehits,
     $next_page || '<no next page>';
 258     foreach my $image (@imagehits) {
 259         my $local_image;
 260         $image =~ s/_thumb\.jpg$/_full.jpg/;
 261         $image =~ m{.*/(.*)} ? $local_image = "jpg/$user/$1" : die;
 262         -f $local_image and do {
 263             $debug and print "$local_image already exists -- not
     re-downloading\n";
 264             next;
 265         };
 266         (my $status, my $headers, my $jpg) = get_page($image);
 267         open my $f, '>', $local_image or die "$local_image: $!";
 268         $f->print($jpg);
 269         $f->close;
 270     }
 271 } while $next_page;
 272 
 273 
 274 --- http.pm ------
 275 
 276 (Exceprts)
 277 
 278 package http;
 279 
 280 use strict;
 281 use warnings;
 282 
 283 use Socket;
 284 use IO::Handle;
 285 use IO::Socket::INET;
 286 use POSIX;
 287 
 288 our $ua;
 289 our $debug = 1;
 290 
 291 sub generate_get_page {
 292 
 293     my %cookies;
 294     my $referer;
 295     $ua or die "set \$http::ua or something";
 296 
 297     return sub {
 298 
 299         (my $page, my @args) = @_;
 300 
 301         my @headers; 
 302         push @headers, "Referer: $referer" if $referer; 
 303 
 304         my $cookies = '';
 305         foreach my $k (keys %cookies) { $cookies .= '; ' if $cookies;
     $cookies .= "$k=$cookies{$k}"; }
 306 print "Cookies! $cookies\n";
 307 
 308         (my $status, my $headers, my $html) = get_page($page, $cookies,
     $ua, \@headers, \@args); 
 309 
 310         %cookies = (%cookies, map { (split /=/, $_) } map { (split /; +/,
     $_)[0] } map { $_->[1] } grep { $_->[0] eq 'Set-Cookie' }
     @$headers);
 311         # delete $cookies{$_} for qw/expires path domain/;
 312 
 313         $referer = $page if grep { lc($_->[0]) eq 'content-type' and
     $_->[1] =~ m<text/html>i } @$headers;
 314         $referer = $page unless grep { lc($_->[0]) eq 'content-type' }
     @$headers; # No Content-Type? Must be text/html.
 315         $debug > 0 and printf "debug: referer is %s\n", $referer || '<not
     set>';
 316 
 317         return ($status, $headers, $html);
 318 
 319     };
 320 }
 321 
 322 sub get_page {
 323 
 324     my $url = shift;
 325     my $cookies = shift;
 326     my $ua = shift;
 327     my @headers = @{ shift() };
 328     my @form = @{ shift() };
 329 
 330     # XXX kind of inconsistant that @headers here is a list of the format
     "Foo: Bar" where it's returned as ['Foo', 'Bar']
 331 
 332     my $postdata = '';
 333     my $port = 80;
 334     my $output = '';
 335 
 336     $debug > 0 and print "\n\ndebug: get_page: $url\n\n";
 337 
 338     while(@form) {
 339         my $key = shift @form;
 340         my $value = escape(shift @form);
 341         $postdata .= '&' if $postdata;
 342         $postdata .= $key . '=' . $value;
 343     }
 344 
 345     eval {
 346 
 347         (my $host, my $page) = parse_url($url);
 348         my $hostaddr = inet_aton($host) or die "couldn't look up host:
     $host - giving up on url $url\n";
 349 
 350         # my $sh = IO::Socket::INET->new(
 351         #     PeerAddr  => $host,
 352         #     PeerPort  => $port,
 353         #     Proto     => 'tcp',
 354         # ) or die $!;
 355 
 356         socket(my $sh, PF_INET, SOCK_STREAM, getprotobyname('tcp')) or
     die $!;
 357     
 358         setsockopt($sh, SOL_SOCKET, SO_SNDTIMEO, pack 'LL', 15, 0 ) or
     die $!;
 359         setsockopt($sh, SOL_SOCKET, SO_RCVTIMEO, pack 'LL', 15, 0 ) or
     die $!;
 360       
 361         connect($sh, sockaddr_in($port, $hostaddr)) or
 362             die "connect failed: $! host: $host page: $page url: $url\n";
 363 
 364         push @headers, "Cookie: $cookies" if $cookies;
 365         push @headers, "Content-Length: " . length $postdata if length
     $postdata;
 366         push @headers, "Content-Type: application/x-www-form-urlencoded"
     if length $postdata;
 367 
 368         my $getstr = join('', map "$_\015\012",
 369             (length $postdata ? "POST $page HTTP/1.0" : "GET $page
     HTTP/1.0"),
 370             "User-Agent: $ua",
 371             "Host: $host",
 372             # "Referer: $url",  # should be sent as an argument now
 373             "Accept: image/gif; image/jpeg; text/html; text/plain; */*",
 374             "Connection: close",
 375             "Accept-Language: en",
 376             @headers,
 377             # XXX Content-type: multipart/form-data; boundary="abcdefg" #
     used for file uploads
 378         );
 379         $getstr .= "\015\012";
 380         $getstr .= $postdata if length $postdata;
 381         $debug > 0 and print "debug: getstr:\n$getstr";
 382         $sh->print($getstr) or die "print failed: $! host: $host page:
     $page url: $url\n";
 383 
 384         $sh->flush();
 385 
 386         read $sh, $output, 4096, length $output while not eof $sh; 
 387 
 388         close $sh;
 389     
 390     };
 391 
 392     if($@) {
 393         $debug > 0 and print "debug: get_page failed: $@\n";
 394         return undef;
 395     } else {
 396         $debug > 0 and print "debug: get_page successful: ", length
     $output, " bytes\n";
 397         return parse_headers($output);
 398     }
 399 
 400 }
 401 
 402 1;
 403 
 404 ---- parse_html() ----
 405     
 406 Please excuse my HTML parsing code.  It's sooooo 2002.  Let this be a
     lesson 
 407 to you -- don't try to parse HTML with regex!  
 408 
 409     sub parse_html {
 410       my $file = shift;
 411       my $callback = shift; $callback ||= sub { return 0; };
 412       my $callerpackage = shift;
 413     
 414       # if $callback->($accessor, %namevaluepairs) returns true, we use
     that return value in
 415       # place of the text that triggered the callback, allowing the
     callback to filter the HTML.
 416     
 417       my $name;
 418       my $text;
 419       my $state;     # 0-outside of tag; 1-inside of tag; 2-expecting
     name of new section
 420       my %keyvals;
 421       my $highwater; # where in the text the last tag started
 422     
 423       my $accessor = sub { ... };
 424     
 425       eval { while(1) {
 426     
 427         if($file =~ m{\G(<!--.*?-->)}sgc) {
 428           $text .= $1;
 429           print "debug: comment\n" if($debug);
 430           my $x = $callback->($accessor, tag=>'comment', text=>$1);
     if(defined $x) {
 431             $text .= $x;
 432           } else {
 433             $text .= $1;
 434           }
 435     
 436         } elsif($file =~ m{\G<([a-z0-9]+)}isgc) {
 437           # start of tag
 438           print "debug: tag-start\n" if($debug);
 439           $highwater = length($text);
 440           %keyvals = (tag => lc($1));
 441           $state=1;
 442           if(lc($1) eq 'div') {
 443             $state=2;
 444           } 
 445           $text .= "<" . cc($1);
 446     
 447         } elsif($file =~ m{\G<(/[a-z0-9]*)>}isgc) {
 448           # end tag
 449           $keyvals{'tag'} = lc($1);
 450           my $x = $callback->($accessor, %keyvals); if(defined $x) {
 451             $text .= $x;
 452           } else {
 453             $text .= "<".cc($1).">";
 454           }
 455           %keyvals=();
 456           print "debug: end-tag\n" if($debug);
 457     
 458         } elsif($file =~ m{\G(\s+)}sgc) {
 459           # whitespace, in or outside of tags
 460           if($state == 0) {
 461             my $x = $callback->($accessor, tag=>'lit', text=>$1);
     if(defined $x) {
 462               $text .= $x;
 463             } else {
 464               $text .= $1;
 465             }
 466           } else {
 467             $text .= $1;
 468           }
 469           print "debug: whitespace\n" if($debug);
 470     
 471         } elsif(($state == 1 || $state == 2) and
 472                 ($file =~ m{\G([a-z0-9_-]+)\s*=\s*(['"])(.*?)\2}isgc or
 473                  $file =~ m{\G([a-z0-9_-]+)\s*=\s*()([^ >]*)}isgc)) {
 474           # name=value pair, where value may or may not be quoted
 475           $keyvals{lc($1)} = $3;
 476           $text .= cc($1) . qq{="$3"}; # XXX need to preserve whitespace
 477           print "debug: name-value pair\n" if($debug);
 478     
 479         } elsif(($state == 1 || $state == 2) and
 480                 ($file =~ m{\G([a-z0-9_-]+)}isgc)) {
 481           # name without a =value attached. if above doesnt match this is
     the fallthrough.
 482           $keyvals{lc($1)} = 1;
 483           $text .= cc($1); # correct case if needed
 484           print "debug: name-value pair without a value\n" if($debug);
 485     
 486         } elsif($file =~ m{\G>}sgc) {
 487           # end of tag
 488           $state=0;
 489           my $x = $callback->($accessor, %keyvals); if(defined $x) {
 490             # overwrite the output with callback's return, starting from
     the beginning of the tag
 491             # $text may have changed (or been deleted) since $highwater
     was recorded
 492             substr($text, $highwater) = $x if($highwater && length($text)
     > $highwater);
 493           } else {
 494             $text .= '>';
 495           }
 496           print "debug: tag-end\n" if($debug);
 497     
 498         } elsif($file =~ m{\G([^<]+)}sgc and $state != 1) {
 499           # between tag literal data
 500           # $text .= $1 unless($state == 2);
 501           my $x = $callback->($accessor, tag=>'lit', text=>$1);
     if(defined $x) {
 502             $text .= $x;
 503           } else {
 504             $text .= $1;
 505           }
 506           print "debug: lit data\n" if($debug);
 507     
 508         } elsif($file =~ m{\G<!([^>]+)}sgc and $state != 1) {
 509           # DTD 
 510           print "debug: dtd\n" if($debug);
 511           $highwater = length($text);
 512           $text .= '<!' . cc($1);
 513           %keyvals = (tag => lc($1));
 514           $state=1;
 515     
 516         } elsif($file =~ m{($macro)}sgc) {  # 5.004 has issues with this
 517           # an escape of whatever format we're using for escapes
 518           print "debug: template escape\n" if($debug);
 519           # XXX if this appears in a tag, no mention will be passed to
     handler,
 520           # which may rewrite the tag wtihout it
 521           $text .= $1;
 522     
 523         } else {
 524           # this should only ever happen on end-of-string, or we have a
     logic error
 525           (my $foo) = $file =~ m{\G(.*)}sgc;
 526           print "stopped at: -->$foo\n" if($debug);
 527           if($foo) {
 528             # this is an error condition
 529             $callback->($accessor, tag=>'stopped', text=>$foo) 
 530           }
 531           return $text;
 532         }
 533       } };
 534       # shouldnt reach this point
 535       print $@ if($debug && $@);
 536       return $text;
 537     }