UTF-16 vs UTF-32

Phillips, Addison addison at lab126.com
Mon May 16 20:43:38 PDT 2011


> > Personally, I think UTF16 is more prone to error than either UTF8 or
> > UTF32 -- in UTF32 there is a one-to-one correspondence
> 
> One-to-one correspondence between string code units and Unicode codepoints.
> 
> Unfortunately, "Unicode codepoint" is only a useful concept for some scripts...
> So you run into the same edge-case issues as UTF-16 does, but in somewhat
> fewer cases.
> 

Not exactly. What is true is that, regardless of the Unicode encoding, a glyph on the screen may be comprised of multiple Unicode characters which, in turn, may be comprised of multiple code units.

I generally present this as:

Glyph == single visual unit of text
Character == single logical unit of text
Code point == integer value assigned to a single character (logical unit of text), generally it is better to refer to "Unicode scalar value" here.
Code unit == single logical encoding unit of memory in a Unicode encoding form (where unit == byte, word, etc.)

So in UTF-8, the 'byte' is the code unit, which is used to encode "code points" [Unicode scalar values] (1 to 4 code units per code point). In UTF-16, the 'word' (16-bit unit) is the code unit, which is used to encode code points (1 to 2 code units per code point).

A glyph or "grapheme cluster" may require multiple characters to encode, so no part of Unicode can assume 1 glyph == 1 character == 1 code point == (x) code units. The values in the surrogate range D800-DFFF *are* valid Unicode code points, but *not* valid Unicode scalar values. Isolated surrogate code points (not part of a surrogate pair in the UTF-16 encoding) are never "well-formed" and surrogate code points are considered invalid in encoding forms other than UTF-16.

As a result, "Unicode code point" is a marginally useful concept in talking about specific character values in Unicode. But a code point is not and must not be confused with a code unit and is slightly worse than referring to a Unicode scalar value. When talking about a supplementary character, generally it is more useful to talk about the Unicode code point that the surrogate pair of UTF-16 code units encode, for example, than the Unicode code point of each surrogate.

Hope that helps.

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.






More information about the es-discuss mailing list