APM: question about unicode encoding

jacob walcik jwalcik at notwithstanding.org
Mon Jan 5 15:01:27 CST 2004


i apologize for my first post to the list being such a lengthy request 
for assistance, however i'm spinning my wheels here and could really 
use some guidance.

i've got a pair of files, coming from different sources, which have 
special characters in them.  in one file, they're represented by their 
html equivalents ( "&#" and then a three digit string) and in the 
other, they're written as a three digit code, escaped by a backslash.  
i need to take both of these files, and convert the text in them to 
unicode, with the special characters properly represented, and then 
insert the resulting strings into tables in a postgres database.

initially, i wrote a pair of scripts, one to deal with each type of 
file, that just read the file in and through a series of regular 
expressions converted the characters, like so:
#!/usr/bin/perl

use Unicode::String qw(utf8);

#open the incoming and outgoing sql files
open(INFILE,"<:utf8","german_1.sql");
open(OUTFILE,">:utf8","german_unicode_2.sql");

while (<INFILE>) {
	$line = utf8($_);
	
	$line =~ s#\374#ü#g;
	...
	
	$line = utf8($line);
	
	print OUTFILE $line;
	
}

close(INFILE);
close(OUTFILE);

in the output, instead all of the encoded characters are missing, and 
haven't been replaced with anything.  i've tried adding a "use utf8;" 
at the beginning, but that doesn't appear to have had any affect.  is 
there another module i need to add unicode support to regular 
expressions?  i've found Unicode::Regex::Set, but that just appears to 
deal with addition and subtraction of characters, not with 
substitutions.

any advice or suggestions would be much appreciated.  thanks.

--
jacob walcik
jwalcik at notwithstanding.org



More information about the Austin mailing list