From sdn.phlpm at mailnull.com  Tue Oct 27 12:45:52 2020
From: sdn.phlpm at mailnull.com (Eric Roode)
Date: Tue, 27 Oct 2020 15:45:52 -0400
Subject: [Philadelphia-pm] Unicode BOM in input files
Message-ID: <CALVSnROTEBvPKRtOb=rgkgVVP7rkV6SVkMwR_Hd-k23FCpZJFQ@mail.gmail.com>

Hello fellow mongers!

    Today I opened and read a file.  Advanced stuff, right?  :-)

open my $fh, '<', 'file.dat';
$line = <$fh>;
if ($line =~ /^Your data:/) ....


    The problem is that the input file has a Unicode BOM (byte-order mark),
so the first three bytes of the string are in fact 0xEF, 0xBB, and 0xBF.
So the match fails, even though if you look at the file in an editor, it
looks like it begins with "Your data".  It took me a fair amount of time to
figure this out.

    I am shocked that I have never encountered this before.  But I'm even
more shocked that Perl doesn't automagically handle this internally.  What
the heck, Perl??  This has me re-thinking how I open all text files!  If I
am opening text files of unknown encoding, am I expected to read the BOM
(if present) and then change the PerlIO encoding via 'binmode' myself?  For
each and every input text file I open and read?  That's BS.  I've gotta be
missing some obvious step.

    Any wisdom from the hive mind would be appreciated.  Thanks!

-- Eric Roode
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/philadelphia-pm/attachments/20201027/a3e463b4/attachment.html>

From jkeenan at pobox.com  Tue Oct 27 13:21:40 2020
From: jkeenan at pobox.com (James E Keenan)
Date: Tue, 27 Oct 2020 16:21:40 -0400
Subject: [Philadelphia-pm] Unicode BOM in input files
In-Reply-To: <CALVSnROTEBvPKRtOb=rgkgVVP7rkV6SVkMwR_Hd-k23FCpZJFQ@mail.gmail.com>
References: <CALVSnROTEBvPKRtOb=rgkgVVP7rkV6SVkMwR_Hd-k23FCpZJFQ@mail.gmail.com>
Message-ID: <c57cef27-dc8b-fcab-6147-cdd6225a1ca1@pobox.com>

On 10/27/20 3:45 PM, Eric Roode wrote:
> Hello fellow mongers!
> 
>  ? ? Today I opened and read a file.? Advanced stuff, right?? :-)
> 
>     open my $fh, '<', 'file.dat';
>     $line = <$fh>;
>     if ($line =~ /^Your data:/) ....
> 
> 
>  ? ? The problem is that the input file has a Unicode BOM (byte-order 
> mark), so the first three bytes of the string are in fact 0xEF, 0xBB, 
> and 0xBF.? So the match fails, even though if you look at the file in an 
> editor, it looks like it begins with "Your data".? It took me a fair 
> amount of time to figure this out.
> 

Yes, this is annoying.  I have encountered the problem before, in the 
form of a bug report for my CPAN distro Text-CSV-Hashify:
https://rt.cpan.org/Ticket/Display.html?id=130048

If you read that ticket, you will appreciate some of the complexities in 
this issue.  Unfortunately, I haven't had time to develop a solution -- 
magical, automagical or otherwise.

Thank you very much.
Jim Keenan

From brainbuz at brainbuz.org  Tue Oct 27 15:18:16 2020
From: brainbuz at brainbuz.org (John Karr)
Date: Tue, 27 Oct 2020 18:18:16 -0400
Subject: [Philadelphia-pm] Unicode BOM in input files
In-Reply-To: <c57cef27-dc8b-fcab-6147-cdd6225a1ca1@pobox.com>
References: <CALVSnROTEBvPKRtOb=rgkgVVP7rkV6SVkMwR_Hd-k23FCpZJFQ@mail.gmail.com>
 <c57cef27-dc8b-fcab-6147-cdd6225a1ca1@pobox.com>
Message-ID: <3452c96f-88a7-ea7d-d7e1-12e4b60e68c8@brainbuz.org>

One of my many wishes for Perl 7 is to switch to native unicode string 
handling. Unfortunately, given the effort just to get strict and 
warnings enabled (which I've been doing a little of and Jim Keenan a lot 
of), the work to pull that off given how much would probably break in 
Perl and CPAN makes it really unlikely barring a deep pocketed corporate 
sponsor.

I recently discovered a trick that helps with one of the problems from 
Perl not being unicode native.

If you add 'export PERL_UNICODE=AS' to your environment many wide 
character errors will vanish. This can also be done by the -C switch to 
Perl or adding 'binmode(STDOUT, ":utf8");' to your boilerplate.

Unfortunately changing the < in open to <:encoding(UTF-8) does not 
change the way the string is read. but

 ??? $line =~ s/^\N{BOM}//;? # will remove the BOM

This is all the sort of headache I want Perl to allow me to magically 
and blissfully never think about.

|
|
On 10/27/20 4:21 PM, James E Keenan wrote:
> On 10/27/20 3:45 PM, Eric Roode wrote:
>> Hello fellow mongers!
>>
>> ?? ? Today I opened and read a file.? Advanced stuff, right? :-)
>>
>> ??? open my $fh, '<', 'file.dat';
>> ??? $line = <$fh>;
>> ??? if ($line =~ /^Your data:/) ....
>>
>>
>> ?? ? The problem is that the input file has a Unicode BOM (byte-order 
>> mark), so the first three bytes of the string are in fact 0xEF, 0xBB, 
>> and 0xBF.? So the match fails, even though if you look at the file in 
>> an editor, it looks like it begins with "Your data".? It took me a 
>> fair amount of time to figure this out.
>>
>
> Yes, this is annoying.? I have encountered the problem before, in the 
> form of a bug report for my CPAN distro Text-CSV-Hashify:
> https://rt.cpan.org/Ticket/Display.html?id=130048
>
> If you read that ticket, you will appreciate some of the complexities 
> in this issue.? Unfortunately, I haven't had time to develop a 
> solution -- magical, automagical or otherwise.
>
> Thank you very much.
> Jim Keenan
> _______________________________________________
> Philadelphia-pm mailing list
> Philadelphia-pm at pm.org
> https://mail.pm.org/mailman/listinfo/philadelphia-pm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/philadelphia-pm/attachments/20201027/11adceaa/attachment.html>