UTF-16 vs UTF-32

Shawn Steele Shawn.Steele at microsoft.com
Mon May 16 18:29:53 PDT 2011


Allen said:
> One reason is that none of the built-in string methods understand surrogate pairs. If you want to do any string processing that recognizes such pairs you have to either handles such pairs as multi-character sequences or do you own character by character processing.

(Which I think is false security, see my earlier comment below)

John said:
> Personally, I think UTF16 is more prone to error than either UTF8 or UTF32 -- in UTF32 there is a one-to-one correspondence, while in UTF8 it is obvious you have to deal with multi-byte encodings.  With UTF16, most developers only run into BMP characters and just assume that there is a one-to-one correspondence between chars and characters.  Then,  when their code runs into non-BMP characters they run into problems like restricting the size of a field to a number of chars and it is no longer long enough, etc.  The problems arise infrequently, which means many developers assume the problem doesn't exist.

The "developers only run into BMP characters and just assume (it works)" is basically the extended problem with multiple codepoint sequences.  Most developers also only think that there's a 1:1 relationship between code points and "glyphs".  If I want to restrict a field to a number of characters, then it ends up being too short not only for surrogate characters, but also for scripts that rely heavily on combining sequences.  In fact, the latter is probably even more likely since most of the surrogates aren't that common.  Once your code can handle that surrogates is a very small part of the problem.

- Shawn

-----Original Message-----
From: es-discuss-bounces at mozilla.org [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Shawn Steele
Sent: Monday, May 16, 2011 6:08 PM
To: mikesamuel at gmail.com
Cc: es-discuss
Subject: RE: UTF-16 vs UTF-32

And why do you care to iterate by code unit?  I mean, sure it seems useful, but how useful?

Do you want to stop in the middle of Ä?  You probably don't stop in the middle of Ä.

I have no problem with regular expressions or APIs that use 32bit (well, 21bit) values, but I think existing apps already have a preconceived notion of the index in the string that it'd be bad to break.  Yes that means += 2 instead of ++ to get past a surrogate pair, but that happens to Ä as well.

-Shawn

-----Original Message-----
From: Mike Samuel [mailto:mikesamuel at gmail.com] 
Sent: Monday, May 16, 2011 6:03 PM
To: Shawn Steele
Cc: Allen Wirfs-Brock; es-discuss
Subject: Re: UTF-16 vs UTF-32

2011/5/16 Shawn Steele <Shawn.Steele at microsoft.com>:
> It's clear why we want to support the full Unicode range, but it's less clear to me why UTF-32 would be desirable internally.  (Sure, it'd be nice for conversion types).

I don't think anyone says that UTF-32 is desirable *internally*.
We're talking about the API of the string type.

I have been operating under the assumption that developers would benefit from a simple way to efficiently iterate by code unit.  An efficient version of the below, ideally one that just works like current for (var i = 0, n = str.length; i < n; ++i) ...str[i]...

    function iterateByCodeUnit(str, fn) {
      str += '';
      for (var i = 0, n = str.length, index; i < n; ++i, ++index) {
        var unit = str.charCodeAt(i);
        if (0xd800 <= unit && unit < 0xdc00 && i + 1 < n) {
          var next = str.charCodeAt(i + 1);
          if (0xdc00 <= next && next < 0xe000) {
            fn(((unit & 0x3ff) << 10) | (next & 0x3ff), i, index);
            ++i;
            continue;
          }
        }
        fn(unit, i, index);
      }
    }

_______________________________________________
es-discuss mailing list
es-discuss at mozilla.org
https://mail.mozilla.org/listinfo/es-discuss


More information about the es-discuss mailing list