Full Unicode strings strawman

Brendan Eich brendan at mozilla.com
Thu May 19 09:50:05 PDT 2011


On May 18, 2011, at 3:46 PM, Waldemar Horwat wrote:

> On 05/16/11 11:11, Allen Wirfs-Brock wrote:
>> I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason.
>> 
>> Feed back would be appreciated:
>> 
>> http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings
>> 
>> Allen
> 
> Two different languages made different decisions on how to approach extending their character sets as Unicode evolved:
> 
> - Java kept their strings encoded exactly as they were (a sequence of 16-bit code units) and provided extra APIs for the cases where you want to extract a code point.

Bloaty.


> - Perl widened the concept of characters in strings away from bytes to full Unicode characters.  Thus a UTF-8 string can be either represented where each byte is one Perl character or where each Unicode character is one Perl character.  There are conversion functions provided to move between the two.

This is analogous but diferent in degree. Going from bytes to Unicode characters is different from going from uint16 to Unicode by ~15 to 5 bits.


> My experience is that Java's approach worked, while Perl's has led to an endless shop of horrors.  The problem is that different APIs expect different kinds of strings, so I'm still finding places where conversions should be added but weren't (or vice versa) in a lot of code years after it was written.

Won't all the relevant browser APIs expect DOMString, which will be the same as JS string? The encoding or lack of it for unformatted 16-bit data is up to the caller and callee. No conversion functions required.

The significant new APIs in Allen's proposal (apart from transcoding helpers that might be useful already, where JS hackers wrongly assume BMP only), are for generating strings from full Unicode code points: String.fromCode, etc.


> 1. I would not be in favor of any approach that widens the concept of a string character or introduces two different representations for a non-BMP character.  It will suffer from the same problems as Perl, except that they will be harder to find because use of non-BMP characters is relatively rare.

We have the "non-BMP-chars encoded as pairs" problem already. The proposal does not increase its incidence, it merely adds ways to make strings that don't have this problem.

The other problem, of mixing strings-with-pairs and strings-without, is real. But it is not obvious to me why it's worse than duplicating all the existing string APIs. Developers still have to choose. Without a new "ustring" type they can still mix. Do duplicated full-Unicode APIs really pay their way?


> 2. Widening characters to 21 bits doesn't really help much.  As stated earlier in this thread, you still want to treat clumps of combining characters together with the character to which they combine, worry about various normalized forms, etc.

Do you? I mean: do web developers?

You can use non-BMP characters in HTML today, pump them through JS, back into the DOM, and render them beautifully on all the latest browsers, IINM. They go as pairs, but the JS code does not care and it can't, unless it hardcodes index and lengths, or does something evil like s.indexOf("\ud800") or whatever.


> All of these require the machinery to deal with clumps of code units as though they were single characters/graphemes/etc., and once you have that machinery you can reuse it to support non-BMP characters while keeping string code units 16 bits.

How do you support non-BMP characters without API bloat? Too many APIs by themselves will simply cause developers to stick to the old APIs when they should use the new ones.

The crucial win of Allen's proposal comes down the road, when someone in a certain locale *can* do s.indexOf(nonBMPChar) and win. That is what Unicode promises and JS fails to deliver. That seems worth considering, rather than s.wideIndexOf(nonBMPChar).

/be



More information about the es-discuss mailing list