[oak perl] Review of "Fonts & Encodings"

Tue Jan 22 18:32:24 PST 2008

hi eli,
thanks for doing as i requested
[and more].
skoal, 
george

--------------------------------------------------

On Tuesday 22 January 2008 17:17, Eli the Bearded wrote:
> George writes:
> > if convenient,
> > i request you explain the difference
> > between character set and character encoding
> > [or alternatively provide a url
> > of an explanation you endorse]?
>
> I provided a very brief example in my post. A "character set" as
> the term is typically used, is an ordered set of characters. A
> "character encoding" is bit-wise representation of those characters.
> The Unicode character set is pretty good about having all characters,
> but there are many ways a document using it can be encoded. Often
> the term "charset" is used to mean "character set and encoding", due
> to early standards documents where the authors didn't appreciate the
> difference.
>
> UTF-8 is a very popular method of encoding Unicode. In UTF-8, all of
> US-ASCII is represented in eight-bit wide characters, and other
> characters are multiples of eight-bit units. All characters not in
> US-ASCII will have the high bit set in every octent in UTF-8.
>
> UTF-7 is a rare method of encoding Unicode. All octets have the
> high bit unset in UTF-7. US-ASCII has characters which need to be
> escaped as multi-octet sequences in UTF-7.
>
> UTF-16 is a common method of encoding Unicode, but due to byte order
> differences between computers, it comes in two different varieties:
> big-endian and little endian. All characters in UTF-16 are multiples
> of sixteen bits wide. I've run into issues where properly formated
> UTF-16 has not been recognized due to the lack of a byte-order-mark
> (BOM, a control character useful to distinguish the two flavors of
> UTF-16). iconv does not include a BOM when converting to UTF-16.
>
> UTF-32 is another method of encoding Unicode. It also comes in
> big-endian and little endian varieties, and as you might have guessed,
> uses characters that er multiples of thirty-two bits wide.
>
> Almost all authors talking about character encoding write in terms
> of octets. Apparently they don't remember PDP-10 systems and the
> like which offered non-octet based encodings.
>
> Alan Flavell (RIP) wrote eloquently about the issue, at least
> in the context of the WWW, but he (obviously) isn't updating it
> anymore, and the only pace to find his pages are on archive.org:
>
> http://web.archive.org/web/20051214075302/ppewww.ph.gla.ac.uk/~flavell/char
>set/internat.html
>
> This page covers the concept quite well, and is not focused on
> implications for the web:
>
> http://www.cs.tut.fi/~jkorpela/chars.html
>
> Once you throw fonts in, which change the look of the glyphs, the
> whole thing starts to reek of semiotics. If you have read your
> Saussure (_Course in General Linguistics_), you'll have no problems
> following along. Lingusitics as it intersects with computers is
> fun stuff.
>
> Elijah
> ------
> don't forget what Larry Wall studied in college
> _______________________________________________
> Oakland mailing list
> Oakland at pm.org
> http://mail.pm.org/mailman/listinfo/oakland