UTF-16 vs UTF-32

Shawn Steele Shawn.Steele at microsoft.com
Mon May 16 18:07:50 PDT 2011


And why do you care to iterate by code unit?  I mean, sure it seems useful, but how useful?

Do you want to stop in the middle of Ä?  You probably don't stop in the middle of Ä.

I have no problem with regular expressions or APIs that use 32bit (well, 21bit) values, but I think existing apps already have a preconceived notion of the index in the string that it'd be bad to break.  Yes that means += 2 instead of ++ to get past a surrogate pair, but that happens to Ä as well.

-Shawn

-----Original Message-----
From: Mike Samuel [mailto:mikesamuel at gmail.com] 
Sent: Monday, May 16, 2011 6:03 PM
To: Shawn Steele
Cc: Allen Wirfs-Brock; es-discuss
Subject: Re: UTF-16 vs UTF-32

2011/5/16 Shawn Steele <Shawn.Steele at microsoft.com>:
> It's clear why we want to support the full Unicode range, but it's less clear to me why UTF-32 would be desirable internally.  (Sure, it'd be nice for conversion types).

I don't think anyone says that UTF-32 is desirable *internally*.
We're talking about the API of the string type.

I have been operating under the assumption that developers would benefit from a simple way to efficiently iterate by code unit.  An efficient version of the below, ideally one that just works like current for (var i = 0, n = str.length; i < n; ++i) ...str[i]...

    function iterateByCodeUnit(str, fn) {
      str += '';
      for (var i = 0, n = str.length, index; i < n; ++i, ++index) {
        var unit = str.charCodeAt(i);
        if (0xd800 <= unit && unit < 0xdc00 && i + 1 < n) {
          var next = str.charCodeAt(i + 1);
          if (0xdc00 <= next && next < 0xe000) {
            fn(((unit & 0x3ff) << 10) | (next & 0x3ff), i, index);
            ++i;
            continue;
          }
        }
        fn(unit, i, index);
      }
    }



More information about the es-discuss mailing list