[tpm] trouble with unicode versus octets

Fulko Hew fulko.hew at gmail.com
Tue Oct 16 06:18:19 PDT 2012


On Mon, Oct 15, 2012 at 3:23 PM, Stuart Watt <stuart at morungos.com> wrote:

> It's not an uncommon problem, but it's a messy one. And it's basically an
> application decision.
>
> The module you need is Encode, and what you probably need is
>
> my $encoded = Encode::encode('utf8', $utf8_string);
>

Yes, I've been reading all this stuff, but it still doesn't make sense to me
(as I see also to many others... http://www.perlmonks.org/?node_id=906373)

All of the responses I've read so far, assume you are processing textual
strings
and not octet strings.

Reading the'perlunitut', I see:

* * Encoding (as a verb) is the conversion from *text* to *binary*. To
encode, you have
  to supply the target encoding, for example iso-8859-1 or UTF-8. Some
encodings,
  like the iso-8859 ("latin") range, do not support the full Unicode
standard;
  characters that can't be represented are lost in the conversion.

The scary part is that I don't really know what the original format is.
(In this case it happens to be text that contains MS Windows file names that
is causing me grief.)

... another day passes since I wrote the above part ...

Trying to patch my original program made me wander in my attempts to fix the
problem, so once I created a simple test program, I discovered that indeed,
   utf8::encode($msg);
would address the problem [ just as described :-) ].
Then once I found the appropriate spot in my code, it had 'compensated' for
the issue.

[ But what happens if I don't feed it a text string (wide or narrow)
  but my octet string instead? What comes out the other end?

  I guess it passes it through transparently
  (knowing that it no longer contains a UTF string)
]

which translates the string that's in UTF8 and which you can't print, into
> a set of bytes in UTF8, which you can. That stops the print error. However,
> this is for printing or writing over a network connection, and you might
> need a different encoding depending on your protocol. The Encode module can
> do most any encoding you like or need, and many that seem ridiculous.
>
> In the case of UTF8, and only because internally Perl uses UTF8, that sets
> a special flag that effectively stops Perl from giving wide character
> errors. But this is highly confusing special behaviour, and it's often
> worth testing Perl with non-UTF8 data printing/communications to flush out
> these issues.
>
> The problems are worse if you don't know what your strings are to begin
> with. It's best to help your app by making everything UTF8 (internally) as
> soon as possible, assuming it isn't already. There is no way to tell.
> reliably, whether a piece of random data really is UTF8 text as that's
> really down to how it is supposed to be interpreted.
>
> --S
>
> On 2012-10-15, at 1:59 PM, Fulko Hew wrote:
>
> I have a problem (so what else is new!) that I haven't yet found a
> solution to ...
>
> In my app, I receive strings, massage them, and 'push_write" them to an AnyEvent
> socket.
>
> Occasionaly, my app receives a unicoded string...
> so when the write happens, Perl (inside the AnyEvent module)
> dies with the error:
>
>    Wide character in subroutine entry at ...
>
> What I haven't figured out yet is, how to coerce the character string into
> an octet string (for the rest of its life, ie. in subsequent modules)
> so the warning/dying goes away.
>
> TIA
> Fulko
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/toronto-pm/attachments/20121016/2909bd4c/attachment.html>


More information about the toronto-pm mailing list