[San-Diego-pm] Cold shower in UTF-8

Tim Bollman tim.bollman at gmail.com
Sat Oct 26 11:57:11 PDT 2013


On Sat, Oct 26, 2013 at 10:43 AM, Brian Manning <elspicyjack at gmail.com> wrote:
> On Sat, Oct 26, 2013 at 6:02 AM, Joel Fentin <joel at fentin.com> wrote:
>> Either you don't understand my problem or I don't understand you or both.
>> But I appreciate your and Russ's efforts.
>
> It must be me.
>
>> Before the MySQL conversion, the operator would type the following into a
>> text area:
>>
>> line1 + [enter key] + line2 + [enter key] + line3
>>
>> When they were done, they would click an OK button.
>> I ran what they typed thru the following code before putting it into the
>> database:
>> $Value =~ s/\15//g; #snuff chr 13 (may screw up db file)
>> $Value =~ s/\n/¶/g; #convert chr 10 to ¶
>>
>> In this case I arbitrarily chose ¶ to represent LF.
>
> Which is not a legal UTF-8 character.
>
>> To later access this for display on a webpage, I took what was in the
>> database and ran it through this:
>> $Value =~ s/¶/<br>/g;
>>
>> The displayed result looked like this:
>> line1
>> line2
>> line3
>>
>> ======================
>>
>> If I attempt this now, I can do the same thing, but would have to replace
>> the display code (above) with:
>> $Value =~ s/¶/<br \/>/g;
>>
>> This because ¶ is greater than chr 127.
>>
>> Rather than roll my own, I'd rather go with a standard. I confess, when I go
>> to http://en.wikipedia.org/wiki/UTF-8
>> I don't quite grasp the Description nor the codepage layout. They give an
>> example of €. I can't follow it. Worse, I don't know how much I need to know
>> and how much I don't.
>
> Can you use a different separator, such as the pipe character '|'
> (decimal 124/0x7c), or use ASCII NUL (0x0), both of which are valid
> UTF-8?  Any character below 0x7f or 127 decimal inclusive in the ASCII
> table is also valid UTF-8.  It sounds like that's all you want to deal
> with at the moment.

I'd recommend staying away from ascii NUL as much as you can. Use 0x1F
(unit separator) or something instead. Equally unused in real text,
but plays well with C.  I suppose it hurts compatibility with Cobol
(and I think some Fortran IO libraries actually use all the seperators
too), but I don't see that as a bad thing.

>
> Thanks,
>
> Brian
> _______________________________________________
> San-Diego-pm mailing list
> San-Diego-pm at pm.org
> http://mail.pm.org/mailman/listinfo/san-diego-pm


More information about the San-Diego-pm mailing list