UTF-16 vs UTF-32
Phillips, Addison
addison at lab126.com
Mon May 16 20:43:38 PDT 2011
> > Personally, I think UTF16 is more prone to error than either UTF8 or
> > UTF32 -- in UTF32 there is a one-to-one correspondence
>
> One-to-one correspondence between string code units and Unicode codepoints.
>
> Unfortunately, "Unicode codepoint" is only a useful concept for some scripts...
> So you run into the same edge-case issues as UTF-16 does, but in somewhat
> fewer cases.
>
Not exactly. What is true is that, regardless of the Unicode encoding, a glyph on the screen may be comprised of multiple Unicode characters which, in turn, may be comprised of multiple code units.
I generally present this as:
Glyph == single visual unit of text
Character == single logical unit of text
Code point == integer value assigned to a single character (logical unit of text), generally it is better to refer to "Unicode scalar value" here.
Code unit == single logical encoding unit of memory in a Unicode encoding form (where unit == byte, word, etc.)
So in UTF-8, the 'byte' is the code unit, which is used to encode "code points" [Unicode scalar values] (1 to 4 code units per code point). In UTF-16, the 'word' (16-bit unit) is the code unit, which is used to encode code points (1 to 2 code units per code point).
A glyph or "grapheme cluster" may require multiple characters to encode, so no part of Unicode can assume 1 glyph == 1 character == 1 code point == (x) code units. The values in the surrogate range D800-DFFF *are* valid Unicode code points, but *not* valid Unicode scalar values. Isolated surrogate code points (not part of a surrogate pair in the UTF-16 encoding) are never "well-formed" and surrogate code points are considered invalid in encoding forms other than UTF-16.
As a result, "Unicode code point" is a marginally useful concept in talking about specific character values in Unicode. But a code point is not and must not be confused with a code unit and is slightly worse than referring to a Unicode scalar value. When talking about a supplementary character, generally it is more useful to talk about the Unicode code point that the surrogate pair of UTF-16 code units encode, for example, than the Unicode code point of each surrogate.
Hope that helps.
Addison
Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)
Internationalization is not a feature.
It is an architecture.
More information about the es-discuss
mailing list