Full Unicode strings strawman

Gillam, Richard gillam at lab126.com
Mon May 16 15:24:59 PDT 2011


I tried to post a pointer to this strawman on this list a few weeks ago, but apparently it didn't reach the list for some reason.

Feed back would be appreciated:


I was actually on the committee when the language you're proposing to change was adopted and, in fact, I think I actually proposed that wording.

The intent behind the original wording was to extend the standard back then in ES3 to allow the use of the full range of Unicode characters, and to do it in more or less the same way that Java had done it: While the actual choice of an internal string representation would be left up to the implementer, all public interfaces (where it made a difference) would behave exactly as if the internal representation was UTF-16.  In particular, you would represent supplementary-plane characters with two \u escape sequences representing a surrogate pair, and interfaces that assigned numeric indexes to characters in strings would do so based on the UTF-16 representation of the string-- a supplementary-plane character would take up two character positions in the string.

I don't have a problem with introducing a new escaping syntax for supplementary-plane characters, but I really don't think you want to go messing with string indexing.  It'd be a breaking change for existing implementations.  I don't think it actually matters if the existing implementation "supported" UTF-16 or not-- if you read string content that included surrogate pairs from some external source, I doubt anything in the JavaScript implementation was filtering out the surrogate pairs because the implementation "only supported UCS-2".  And most things would have worked fine.  But the characters would be numbered according to their UTF-16 representation.

If you want to introduce new APIs that index things according to the UTF-32 representation, that'd be okay, but it's more of a burden for implementations that use UTF-16 for their internal representation, and we optimized for that on the assumption that it was the most common choice.

Defining String.fromCharCode() to build a string based on an abstract Unicode code point value might be okay (although it might be better to make that a new function), but when presented with a code point value about 0xFFFF, it'd produce a string of length 2-- the length of the UTF-16 representation.  String.charCodeAt() was always defined, and should continue to be defined, based on the UTF-16 representation.  If you want to introduce a new API based on the UTF-32 representation, fine.

I'd also recommend against flogging the 21-bit thing so heavily-- the 21-bit thing is sort of an accident of history, and not all 21-bit values are legal Unicode code point values either.  I'd use "32" for the longer forms of things.

I think it's fine to have everything work in terms of abstract Unicode code points, but I don't think you can ignore the backward-compatibility issues with character indexing in the current API.

--Rich Gillam

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110516/c38194c1/attachment.html>

More information about the es-discuss mailing list