APM: question about unicode encoding
jacob walcik
jwalcik at notwithstanding.org
Mon Jan 5 15:01:27 CST 2004
i apologize for my first post to the list being such a lengthy request
for assistance, however i'm spinning my wheels here and could really
use some guidance.
i've got a pair of files, coming from different sources, which have
special characters in them. in one file, they're represented by their
html equivalents ( "&#" and then a three digit string) and in the
other, they're written as a three digit code, escaped by a backslash.
i need to take both of these files, and convert the text in them to
unicode, with the special characters properly represented, and then
insert the resulting strings into tables in a postgres database.
initially, i wrote a pair of scripts, one to deal with each type of
file, that just read the file in and through a series of regular
expressions converted the characters, like so:
#!/usr/bin/perl
use Unicode::String qw(utf8);
#open the incoming and outgoing sql files
open(INFILE,"<:utf8","german_1.sql");
open(OUTFILE,">:utf8","german_unicode_2.sql");
while (<INFILE>) {
$line = utf8($_);
$line =~ s#\374#ü#g;
...
$line = utf8($line);
print OUTFILE $line;
}
close(INFILE);
close(OUTFILE);
in the output, instead all of the encoded characters are missing, and
haven't been replaced with anything. i've tried adding a "use utf8;"
at the beginning, but that doesn't appear to have had any affect. is
there another module i need to add unicode support to regular
expressions? i've found Unicode::Regex::Set, but that just appears to
deal with addition and subtraction of characters, not with
substitutions.
any advice or suggestions would be much appreciated. thanks.
--
jacob walcik
jwalcik at notwithstanding.org
More information about the Austin
mailing list