[San-Diego-pm] Cold shower in UTF-8

Sat Oct 26 16:10:06 PDT 2013

On 10/26/2013 11:57 AM, Tim Bollman wrote:
> On Sat, Oct 26, 2013 at 10:43 AM, Brian Manning <elspicyjack at gmail.com> wrote:
>> On Sat, Oct 26, 2013 at 6:02 AM, Joel Fentin <joel at fentin.com> wrote:
>>> Either you don't understand my problem or I don't understand you or both.
>>> But I appreciate your and Russ's efforts.
>>
>> It must be me.
>>
>>> Before the MySQL conversion, the operator would type the following into a
>>> text area:
>>>
>>> line1 + [enter key] + line2 + [enter key] + line3
>>>
>>> When they were done, they would click an OK button.
>>> I ran what they typed thru the following code before putting it into the
>>> database:
>>> $Value =~ s/\15//g; #snuff chr 13 (may screw up db file)

I don't understand why you're doing this. How could a CR character
possibly "screw up" the db file? You're storing a string into a text
column. You ought to be able to incorporate anything you like in the string.

If, for some reason, you do encounter problems using a text column, try
using a blob.

>>> $Value =~ s/\n/¶/g; #convert chr 10 to ¶
>>>
>>> In this case I arbitrarily chose ¶ to represent LF.
>>
>> Which is not a legal UTF-8 character.
>>
>>> To later access this for display on a webpage, I took what was in the
>>> database and ran it through this:
>>> $Value =~ s/¶/<br>/g;
>>>
>>> The displayed result looked like this:
>>> line1
>>> line2
>>> line3
>>>
>>> ======================
>>>
>>> If I attempt this now, I can do the same thing, but would have to replace
>>> the display code (above) with:
>>> $Value =~ s/Â¶/<br \/>/g;
>>>
>>> This because ¶ is greater than chr 127.
>>>
>>> Rather than roll my own, I'd rather go with a standard. I confess, when I go
>>> to http://en.wikipedia.org/wiki/UTF-8
>>> I don't quite grasp the Description nor the codepage layout. They give an
>>> example of €. I can't follow it. Worse, I don't know how much I need to know
>>> and how much I don't.
>>
>> Can you use a different separator, such as the pipe character '|'
>> (decimal 124/0x7c), or use ASCII NUL (0x0), both of which are valid
>> UTF-8?  Any character below 0x7f or 127 decimal inclusive in the ASCII
>> table is also valid UTF-8.  It sounds like that's all you want to deal
>> with at the moment.
> 
> I'd recommend staying away from ascii NUL as much as you can. Use 0x1F
> (unit separator) or something instead. Equally unused in real text,
> but plays well with C.  I suppose it hurts compatibility with Cobol
> (and I think some Fortran IO libraries actually use all the seperators
> too), but I don't see that as a bad thing.
> 
>>
>> Thanks,
>>
>> Brian
>> _______________________________________________
>> San-Diego-pm mailing list
>> San-Diego-pm at pm.org
>> http://mail.pm.org/mailman/listinfo/san-diego-pm
> _______________________________________________
> San-Diego-pm mailing list
> San-Diego-pm at pm.org
> http://mail.pm.org/mailman/listinfo/san-diego-pm
>