[oak perl] Review of "Fonts & Encodings"

Eli the Bearded oaklandpm at eli.users.panix.com
Tue Jan 22 17:17:36 PST 2008


George writes:
> if convenient,
> i request you explain the difference
> between character set and character encoding
> [or alternatively provide a url
> of an explanation you endorse]?

I provided a very brief example in my post. A "character set" as
the term is typically used, is an ordered set of characters. A
"character encoding" is bit-wise representation of those characters.
The Unicode character set is pretty good about having all characters,
but there are many ways a document using it can be encoded. Often
the term "charset" is used to mean "character set and encoding", due
to early standards documents where the authors didn't appreciate the
difference.

UTF-8 is a very popular method of encoding Unicode. In UTF-8, all of
US-ASCII is represented in eight-bit wide characters, and other
characters are multiples of eight-bit units. All characters not in
US-ASCII will have the high bit set in every octent in UTF-8. 

UTF-7 is a rare method of encoding Unicode. All octets have the
high bit unset in UTF-7. US-ASCII has characters which need to be
escaped as multi-octet sequences in UTF-7.

UTF-16 is a common method of encoding Unicode, but due to byte order
differences between computers, it comes in two different varieties:
big-endian and little endian. All characters in UTF-16 are multiples
of sixteen bits wide. I've run into issues where properly formated
UTF-16 has not been recognized due to the lack of a byte-order-mark
(BOM, a control character useful to distinguish the two flavors of
UTF-16). iconv does not include a BOM when converting to UTF-16.

UTF-32 is another method of encoding Unicode. It also comes in
big-endian and little endian varieties, and as you might have guessed,
uses characters that er multiples of thirty-two bits wide.

Almost all authors talking about character encoding write in terms
of octets. Apparently they don't remember PDP-10 systems and the
like which offered non-octet based encodings.

Alan Flavell (RIP) wrote eloquently about the issue, at least
in the context of the WWW, but he (obviously) isn't updating it
anymore, and the only pace to find his pages are on archive.org:

http://web.archive.org/web/20051214075302/ppewww.ph.gla.ac.uk/~flavell/charset/internat.html

This page covers the concept quite well, and is not focused on 
implications for the web:

http://www.cs.tut.fi/~jkorpela/chars.html

Once you throw fonts in, which change the look of the glyphs, the
whole thing starts to reek of semiotics. If you have read your
Saussure (_Course in General Linguistics_), you'll have no problems
following along. Lingusitics as it intersects with computers is
fun stuff.

Elijah
------
don't forget what Larry Wall studied in college


More information about the Oakland mailing list