UTF-16 vs UTF-32
Shawn Steele
Shawn.Steele at microsoft.com
Mon May 16 18:07:50 PDT 2011
And why do you care to iterate by code unit? I mean, sure it seems useful, but how useful?
Do you want to stop in the middle of Ä? You probably don't stop in the middle of Ä.
I have no problem with regular expressions or APIs that use 32bit (well, 21bit) values, but I think existing apps already have a preconceived notion of the index in the string that it'd be bad to break. Yes that means += 2 instead of ++ to get past a surrogate pair, but that happens to Ä as well.
-Shawn
-----Original Message-----
From: Mike Samuel [mailto:mikesamuel at gmail.com]
Sent: Monday, May 16, 2011 6:03 PM
To: Shawn Steele
Cc: Allen Wirfs-Brock; es-discuss
Subject: Re: UTF-16 vs UTF-32
2011/5/16 Shawn Steele <Shawn.Steele at microsoft.com>:
> It's clear why we want to support the full Unicode range, but it's less clear to me why UTF-32 would be desirable internally. (Sure, it'd be nice for conversion types).
I don't think anyone says that UTF-32 is desirable *internally*.
We're talking about the API of the string type.
I have been operating under the assumption that developers would benefit from a simple way to efficiently iterate by code unit. An efficient version of the below, ideally one that just works like current for (var i = 0, n = str.length; i < n; ++i) ...str[i]...
function iterateByCodeUnit(str, fn) {
str += '';
for (var i = 0, n = str.length, index; i < n; ++i, ++index) {
var unit = str.charCodeAt(i);
if (0xd800 <= unit && unit < 0xdc00 && i + 1 < n) {
var next = str.charCodeAt(i + 1);
if (0xdc00 <= next && next < 0xe000) {
fn(((unit & 0x3ff) << 10) | (next & 0x3ff), i, index);
++i;
continue;
}
}
fn(unit, i, index);
}
}
More information about the es-discuss
mailing list