[oak perl] [sf-perl] Internationalizing a Perl application

Mon Jan 15 18:02:07 PST 2007

On Mon, Jan 15, 2007 at 12:14:31PM -0800, David Fetter wrote:
> Kind people,
> 
> With DBI-Link's growing popularity world-wide, I'm getting more
> requests for translation of the docs and strings.  This presents a
> huge bunch of opportunities for me to make mistakes, so I'm asking for
> the collective wisdom out there.

I have many individual clues.  Hopefully they add up to something useful.

First of all, you really have two tasks: I) translate the documentation, which
basically just requires some human translators, and II) translate the strings
in the program, which requires some human translators plus retrofitting
DBI-Link with internationalization and localization (I18N / L10N) code.  Don't
worry; it's not painful.

Now, some definitions.  These all apply to task II.  I'm giving them because
you asked the difference between locale and encoding; forgive me if I'm
repeating the obvious.

A.  locale:  the "culture" in which your app is running--for instance Spanish,
American English, or British English.  Locales determine language, and
sometimes things like number formatting (think 3,14 versus 3.14).  Probably
you just have to worry about language.

There's an RFC that defines the familiar two-letter codes for locales, like
es, en-US, and en-GB.  See http://en.wikipedia.org/wiki/Locale .

Your program finds out what locale it's running in by querying the operating
environment somehow.  You'll want to use some library like GNU gettext that
will take care of these details for you.

B.  internationalization (a.k.a. I18N): the process of (re)writing your app so
that it can support different locales.  In common practice (e.g. with
gettext), this means wrapping all your English strings with some function that
will return the English in an English locale, and the appropriate translation
in any other locale.  Of course, this requires a database of translation
strings, which brings us to...

C.  localization (a.k.a. L10N):  the process of adding translations.
First you internationalize it; then you add a database of translation
strings for Italian, a database for Chinese, et cetera.  There's a nice
separation of concerns, so that you just add a new file for each new
language; you don't have to change your actual code, and, in fact, you
can accept localization databases (or "modules") from non-coders,
which is just what you want to do.

So much for I18N and L10N.  Now on to encoding, a separate issue:

D.  encoding:  how you represent a language on the level of bits and bytes.
Familiar encodings are ASCII, UTF-8 (one of several Unicode encodings),
and ISO-Latin-1.  You don't need to know much about these, because you
want to use UTF-8, a Unicode encoding.  Trust me.

E.  Unicode: a standardized character set that encompasses all the written
languages of the world (well, all the ones that someone's bothered defining
standards for.  There's space for many more.  People are actively reserving
that space and adding new languages, like Tibetan).

Unicode has become _the_ standard encoding for written language.  It's
supported much more widely than anything else.

F:  UTF-8:  the most popular way of encoding Unicode.  One big advantage of
UTF-8 is that it's a proper superset of ASCII.  That means all ASCII strings
are already UTF-8.  (For languages like Chinese which have many characters,
UTF-8 uses a special "escape" character followed by multibyte characters.)

Encoding summary:  use UTF-8 and don't worry about it. :)  (You may need
to alter your regexes slightly, but this is true for any encoding
besides ASCII.  I'll explain further later on.)

Now, with these definitions out of the way, I'll take a stab at your other
questions:

> * Where do I start?

1.  Probably the people who are requesting translations are the very people
most able to provide them.  They're contacting you in English to make the
request, right?  And they're native speakers of the target language?  They're
certainly motivated, and they probably have contacts among other DBI-Link
users (e.g. at work) who are also native speakers.

In other words, when people ask "Why aren't there docs in Portuguese?" you
say, "Because you haven't written them yet." ;)

1a.  If they're willing, send them the English localization module file,
and have them translate it into their native language (written in UTF-8,
of course).  This is easy; it's just a bunch of strings.

1b.  It may be harder to get them to commit to translating the documentation,
but if they're willing, great!

1c.  If they can't do it, ask them if they know someone who can.

2.  There have been attempts to mount organized translation projects
for (e.g.) the Linux HOWTOs.  I don't know how successful these
have been.  You should try searching for "translation project"
or "Linux translation project", or the like.

> * What are some good techniques for dividing the work?

3.  From a linguistic perspective, give translators full control.
Your role is to accept and present/distribute what they produce.

3a.  Temper this with instant peer review.  Use a wiki or some other
user-editable format so other native speakers can improve and clarify.

4.  From a technical perspective, GNU gettext is popular among C
programmers.  As I wrote earlier, it lets you write and maintain
a separate localization module for each language, without mucking
with the code.  This is a good model.  If you have to use it,
you're in pretty good shape.

4b.  There is a Gettext module on CPAN that interfaces to GNU gettext.
As a side note, it's written by James Briggs, who used to run Silicon
Valley Perl Mongers (dormant since August).

4c.  There may be something even better.  Perl Monks is a good place to ask
about what is really useful.

> * What am I going to stub my toes on no matter what I do?

5.  To get proper Unicode support, you must use Perl 5.8 or higher.
However, I think DBI-Link already has this requirement--right?
So maybe it's not an issue.

> * What pains can I avoid, and how?

6.  Use UTF-8!  Picking an encoding used to be a big issue, but UTF-8's
wide adoption solves that problem.

6a.  That said, you'll have to learn how to handle Unicode properly in your
Perl code, including your regexes.

6a(i).  Advanced Perl Programming (Second Edition only) has a chapter
on Unicode.  I see this by looking up the Table of Contents in Safari.
I haven't read that chapter, but you should check it out.

6a(ii).  Perl Best Practices p. 248 says you should use Unicode character
classes within regexes, using the \p{} syntax (because character classes
like [A-Za-z] won't capture all Unicode).  man perlunicode(1) for details.

7.  Wherever possible, recruit native speakers of your target languages.
If you use non-native translators, no matter how fluent they are, they
will make mistakes.  These can range from amusing (think of all
those badly translated Japanese video games) to downright confusing.

7a.  On the other hand, bad translation are better than no translations.
If you can't find a native speaker, and someone else is volunteering,
let them go for it.  Again, always use a wiki (or similar) format so people
can make corrections.

OK, I hope all that helps.  I know it's somewhat desultory.  You can see there
are big gaps in what I know, but I hope you can pick up on some of these
leads.  If they prove helpful, consider including me in the DBI-Link credits.

Good luck,

-- 
Quinn Weaver DBA Fairpath                http://fairpath.com/quinn/contact/
President, San Francisco Perl Mongers    http://sf.pm.org/