[Pdx-pm] unicode vs western text encoding
David E. Wheeler
david at kineticode.com
Thu Jul 3 17:56:15 PDT 2008
On Jul 3, 2008, at 13:30, Thomas Keller wrote:
> We get some emails that we need to parse. They come from a web form,
> so I don't know why we are receiving some with unicode characters
> and others as simple western text encoding. We receive the submitted
> forms as a structured email message which I've written a parser to
> process. I'm having trouble when they contain unicode characters.
> Does anyone have a suggestion for first, detecting unicode in a text
> file, and second stripping it of the weird stuff? I know I can just
> use the translate function. Is that the "best" way? I'd have to know
> ahead of time all the characters that I want to allow, that seems
> really anti-best practices.
If the content-type header includes an encoding, decode it using
Encode::decode(). If there isn't, just strip out all non-Unicode
characters. I'd first run Encode::ZapCP1252::fix_cp1252() on it, just
to be safe, and then strip out non-Unicode characters (Can't remember
the code for that off-hand…).
More information about the Pdx-pm-list