[Philadelphia-pm] Unicode BOM in input files

John Karr brainbuz at brainbuz.org
Tue Oct 27 15:18:16 PDT 2020


One of my many wishes for Perl 7 is to switch to native unicode string 
handling. Unfortunately, given the effort just to get strict and 
warnings enabled (which I've been doing a little of and Jim Keenan a lot 
of), the work to pull that off given how much would probably break in 
Perl and CPAN makes it really unlikely barring a deep pocketed corporate 
sponsor.

I recently discovered a trick that helps with one of the problems from 
Perl not being unicode native.

If you add 'export PERL_UNICODE=AS' to your environment many wide 
character errors will vanish. This can also be done by the -C switch to 
Perl or adding 'binmode(STDOUT, ":utf8");' to your boilerplate.

Unfortunately changing the < in open to <:encoding(UTF-8) does not 
change the way the string is read. but

     $line =~ s/^\N{BOM}//;  # will remove the BOM

This is all the sort of headache I want Perl to allow me to magically 
and blissfully never think about.

|
|
On 10/27/20 4:21 PM, James E Keenan wrote:
> On 10/27/20 3:45 PM, Eric Roode wrote:
>> Hello fellow mongers!
>>
>>      Today I opened and read a file.  Advanced stuff, right? :-)
>>
>>     open my $fh, '<', 'file.dat';
>>     $line = <$fh>;
>>     if ($line =~ /^Your data:/) ....
>>
>>
>>      The problem is that the input file has a Unicode BOM (byte-order 
>> mark), so the first three bytes of the string are in fact 0xEF, 0xBB, 
>> and 0xBF.  So the match fails, even though if you look at the file in 
>> an editor, it looks like it begins with "Your data".  It took me a 
>> fair amount of time to figure this out.
>>
>
> Yes, this is annoying.  I have encountered the problem before, in the 
> form of a bug report for my CPAN distro Text-CSV-Hashify:
> https://rt.cpan.org/Ticket/Display.html?id=130048
>
> If you read that ticket, you will appreciate some of the complexities 
> in this issue.  Unfortunately, I haven't had time to develop a 
> solution -- magical, automagical or otherwise.
>
> Thank you very much.
> Jim Keenan
> _______________________________________________
> Philadelphia-pm mailing list
> Philadelphia-pm at pm.org
> https://mail.pm.org/mailman/listinfo/philadelphia-pm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/philadelphia-pm/attachments/20201027/11adceaa/attachment.html>


More information about the Philadelphia-pm mailing list