[Pdx-pm] unicode vs western text encoding

David E. Wheeler david at kineticode.com
Thu Jul 3 17:56:15 PDT 2008


On Jul 3, 2008, at 13:30, Thomas Keller wrote:

> We get some emails that we need to parse. They come from a web form,  
> so I don't know why we are receiving some with unicode characters  
> and others as simple western text encoding. We receive the submitted  
> forms as a structured email message which I've written a parser to  
> process. I'm having trouble when they contain unicode characters.  
> Does anyone have a suggestion for first, detecting unicode in a text  
> file, and second stripping it of the weird stuff? I know I can just  
> use the translate function. Is that the "best" way? I'd have to know  
> ahead of time all the characters that I want to allow, that seems  
> really anti-best practices.

If the content-type header includes an encoding, decode it using  
Encode::decode(). If there isn't, just strip out all non-Unicode  
characters. I'd first run Encode::ZapCP1252::fix_cp1252() on it, just  
to be safe, and then strip out non-Unicode characters (Can't remember  
the code for that off-hand…).

Best,

David


More information about the Pdx-pm-list mailing list