[Philadelphia-pm] Unicode BOM in input files

Eric Roode sdn.phlpm at mailnull.com
Tue Oct 27 12:45:52 PDT 2020


Hello fellow mongers!

    Today I opened and read a file.  Advanced stuff, right?  :-)

open my $fh, '<', 'file.dat';
$line = <$fh>;
if ($line =~ /^Your data:/) ....


    The problem is that the input file has a Unicode BOM (byte-order mark),
so the first three bytes of the string are in fact 0xEF, 0xBB, and 0xBF.
So the match fails, even though if you look at the file in an editor, it
looks like it begins with "Your data".  It took me a fair amount of time to
figure this out.

    I am shocked that I have never encountered this before.  But I'm even
more shocked that Perl doesn't automagically handle this internally.  What
the heck, Perl??  This has me re-thinking how I open all text files!  If I
am opening text files of unknown encoding, am I expected to read the BOM
(if present) and then change the PerlIO encoding via 'binmode' myself?  For
each and every input text file I open and read?  That's BS.  I've gotta be
missing some obvious step.

    Any wisdom from the hive mind would be appreciated.  Thanks!

-- Eric Roode
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.pm.org/pipermail/philadelphia-pm/attachments/20201027/a3e463b4/attachment.html>


More information about the Philadelphia-pm mailing list