[Pdx-pm] unicode vs western text encoding
Amy K. Farrell
akf at aracnet.com
Fri Jul 4 00:10:48 PDT 2008
On Thu, Jul 03, 2008 at 01:30:40PM -0700, Thomas Keller wrote:
> We get some emails that we need to parse. They come from a web form,
> so I don't know why we are receiving some with unicode characters and
> others as simple western text encoding.
This is a bit of a shot in the dark, but ...
Are you saying that the user submits a web form, and the web server
then sends the email? If so, web browsers should (should!) use the
same character encoding specified by the form's Content-type header
when submitting the data.
For example, if the Content-type of the form is "text/html,
charset=UTF-8", text should be submitted in the UTF-8 encoding (of
which ASCII is a subset, so it could be all your messages are really
UTF-8). You should be able to decode as David suggests, specifying
that encoding for all the messages.
If no charset is specified for the web form at all, that may be why
you're getting unexpected results.
More information about the Pdx-pm-list