[tpm] trouble with unicode versus octets
stuart at morungos.com
Tue Oct 16 06:54:25 PDT 2012
Hmmm. You should really use Encode, not utf8::encode. They ought to be equivalent, i.e.,
$foo = utf8::encode($bar);
$foo = Encode::encode('utf8', $bar);
ought to be identical. And the same is true for decode.
Now to the interesting bit -- not knowing about the original format. I had that issue, and there is no real solution to it in the general case. Windows is supposed to store file names as UTF16LE underneath, but the API for Perl won't give you that directly.
You might try Encode::Guess as a start.
The Encode functions do include a check parameter which you can use to get the Encode functions to tell you when the data doesn't match the encoding expected, this allows you to try decode, and if it doesn't match, try something else. I've done this for a Windows app where I was expecting one of a small number of encodings, but you need to try them in order, or write a (byte level) heuristic to make a guess. On files you fairly often have a byte order mark (for non-8 bit encodings) and File::BOM will help there. That doesn't help with filenames.
printing does transparently pass through octets in all cases. That's why the ::encode functions stop the errors. It might look like all that happens is the utf8 is sent directly through, but that is not actually the case. Perl just treats the string as a set of octets and tries to print. If it ever finds one >7 bits (i.e., non ASCII), it moans. Strict utf8 makes this impossible without an octet >127. Sloppy utf8 allows zeros to be used to pad, and it is technically (maybe) possible for these to be encoded to <= 127 bit values and print just fine. Or it might not happen. But conceptually it could. The only safe thing to do is to build the code so that incoming data is correctly encoded as soon as possible (or assumed to be).
You still need to be careful. Even on Windows + Perl, you can easily double encode utf8. If you call Encode::encode more than once on the same string, you garble it. That's why it's really down to the application to maintain integrity. In retrospect, Perl hides unicode so well that it is really hard to find where an issue lies when all goes wrong.
Finally, if the problem is in the Windows file functions, things might simply be bad. I struggled with that area for a good long time. I may be able to dig out code that reduces the issue, but on my ever-expanding to-do list was a rewrite of the Perl core Windows API to finally handle filenames > 250 characters and handle unicode file names. Neither are, as far as I could tell from looking at the C code, completely correct. Certainly, my app would garble file names in these areas of Perl.
Your best bet may be some controlled testing to pin down the problems, e.g., a known file name with an accented character. If you do this, I'd be very grateful for the test cases, which might prod me into doing something about it in a cold winter night.
On 2012-10-16, at 9:18 AM, Fulko Hew wrote:
> On Mon, Oct 15, 2012 at 3:23 PM, Stuart Watt <stuart at morungos.com> wrote:
> It's not an uncommon problem, but it's a messy one. And it's basically an application decision.
> The module you need is Encode, and what you probably need is
> my $encoded = Encode::encode('utf8', $utf8_string);
> Yes, I've been reading all this stuff, but it still doesn't make sense to me
> (as I see also to many others... http://www.perlmonks.org/?node_id=906373)
> All of the responses I've read so far, assume you are processing textual strings
> and not octet strings.
> Reading the'perlunitut', I see:
> Encoding (as a verb) is the conversion from text to binary. To encode, you have
> to supply the target encoding, for example iso-8859-1 or UTF-8. Some encodings,
> like the iso-8859 ("latin") range, do not support the full Unicode standard;
> characters that can't be represented are lost in the conversion.
> The scary part is that I don't really know what the original format is.
> (In this case it happens to be text that contains MS Windows file names that
> is causing me grief.)
> ... another day passes since I wrote the above part ...
> Trying to patch my original program made me wander in my attempts to fix the
> problem, so once I created a simple test program, I discovered that indeed,
> would address the problem [ just as described :-) ].
> Then once I found the appropriate spot in my code, it had 'compensated' for the issue.
> [ But what happens if I don't feed it a text string (wide or narrow)
> but my octet string instead? What comes out the other end?
> I guess it passes it through transparently
> (knowing that it no longer contains a UTF string)
> which translates the string that's in UTF8 and which you can't print, into a set of bytes in UTF8, which you can. That stops the print error. However, this is for printing or writing over a network connection, and you might need a different encoding depending on your protocol. The Encode module can do most any encoding you like or need, and many that seem ridiculous.
> In the case of UTF8, and only because internally Perl uses UTF8, that sets a special flag that effectively stops Perl from giving wide character errors. But this is highly confusing special behaviour, and it's often worth testing Perl with non-UTF8 data printing/communications to flush out these issues.
> The problems are worse if you don't know what your strings are to begin with. It's best to help your app by making everything UTF8 (internally) as soon as possible, assuming it isn't already. There is no way to tell. reliably, whether a piece of random data really is UTF8 text as that's really down to how it is supposed to be interpreted.
> On 2012-10-15, at 1:59 PM, Fulko Hew wrote:
>> I have a problem (so what else is new!) that I haven't yet found a solution to ...
>> In my app, I receive strings, massage them, and 'push_write" them to an AnyEvent socket.
>> Occasionaly, my app receives a unicoded string...
>> so when the write happens, Perl (inside the AnyEvent module)
>> dies with the error:
>> Wide character in subroutine entry at ...
>> What I haven't figured out yet is, how to coerce the character string into
>> an octet string (for the rest of its life, ie. in subsequent modules)
>> so the warning/dying goes away.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the toronto-pm