question about perl 5.6

Robin Houston robin at kitsite.com
Mon May 13 12:39:28 CDT 2002


On Mon, May 13, 2002 at 06:00:45PM +0100, Paul Johnston wrote:
> So I think it should be like:
> 
>  s/&#([0-9]*)/pack("U",\1)/eg;
> 
> However this does not work,

In a Perl expression (i.e. when you're using the /e switch)
you have to use $1 in place of \1 to pick up the contents
of the bracket. \1 is a reference to the constant scalar "1",
which when evaluated as a number will come out to some fairly
arbitrary memory address -- probably not what you wanted!

There's a pertinent section in the "perlre" man page:

       WARNING on \1 vs $1

       Some people get too used to writing things like:

           $pattern =~ s/(\W)/\\\1/g;

       This is grandfathered for the RHS of a substitute to avoid
       shocking the sed addicts, but it's a dirty habit to get
       into.  That's because in PerlThink, the righthand side of
       a s/// is a double-quoted string.  \1 in the usual double-
       quoted string means a control-A.  The customary Unix
       meaning of \1 is kludged in for s///.  However, if you get
       into the habit of doing that, you get yourself into
       trouble if you then add an /e modifier.

           s/(\d+)/ \1 + 1 /eg;        # causes warning under -w

       Or if you try to do

           s/(\d+)/\1000/;

       You can't disambiguate that by saying \{1}000, whereas you
       can fix it with ${1}000.  Basically, the operation of
       interpolation should not be confused with the operation of
       matching a backreference.  Certainly they mean two
       different things on the left side of the s///.


Bearing in mind that
  - \d is shorthand for [0-9] in a regular expression
  - entity references in HTML ought to be terminated with a semicolon

Something like

  s/&#(\d+);/pack("U", $1)/eg;

ought to do the trick.


If you're using the Perl 5.6 series, you may need to bear in mind that
the Unicode semantics are significantly different in the upcoming Perl
5.8. For example,

  print pack("U", 163);

will print a pound sign (ISO-Latin-1 character 163) under 5.8-to-be,
but under 5.6 will print a two-byte UTF-8 sequence. If you intend the
output to be UTF-8, you may want to add something like

  eval q{ binmode(STDOUT, ":utf8") } if $] > 5.006;

to the beginning of your program. (I added the eval because the
second argument to binmode will cause a "Useless use of constant"
warning at compile-time under older perls.)


Does this help?

 .robin.
--
   You are currently subscribed to manchester-pm-list.  To unsubscribe, send the following message to majordomo at happyfunball.pm.org:
  unsubscribe manchester-pm-list



More information about the Manchester-pm mailing list